Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
ย
A Web-Centric Pipeline for Archiving Scholarly Artifacts
1. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Martin Klein
Los Alamos National Laboratory
@mart1nkle1n
https://orcid.org/0000-0003-0130-2097
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
https://orcid.org/0000-0002-0715-6126
A Web-Centric Pipeline for Archiving Scholarly Artifacts
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
3. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โข Consideration
โข Researchers are increasingly using a variety of web platforms for
collaboration and communication
โข Why?
โข Many of these platforms have desirable characteristics
โข Versioning
โข Time stamping
โข Social embedding
โข Their institutions do not provide platforms that have global reach
โข Collaboration, cf. Github ~ productivity
โข Communication, cf. SlideShare ~ visibility
Research and Research Communication on the Web
6. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โข Consideration
โข Researchers deposit artifacts in these web platforms
โข Web Platforms:
โข Dedicated to scholarship:
โข Commercial: e.g., FigShare, Publons
โข Not for profit: e.g., OSF, Zenodo
โข General purpose:
โข Commercial: e.g., GitHub, SlideShare
โข Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
7. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โข Consideration
โข Researchers deposit artifacts in these web platforms
โข Status quo - The researchersโ institutions commonly:
โข Do not know about the existence of these artifact
โข Do not have a copy of these artifacts
Research and Research Communication on the Web
8. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โข Consideration
โข Researchers deposit artifacts in these web platforms
โข Status quo โ Uncertainty regarding long-term accessibility of these
artifacts:
โข General purpose platforms donโt provide long-term access
guarantees; platforms dedicated to scholarship commonly do
โข Uncertainty regarding the sustainability of unhindered long-
term access to artifacts in these platforms:
โข Commercial: when is the change in business model
coming?
โข Not for profit: will the next round of grant applications,
member contributions be successful?
Research and Research Communication on the Web
9. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โข Consideration
โข Researchers deposit artifacts in these web platforms
โข Status quo - These artifacts are not systematically archived:
โข No frameworks like LOCKSS/Portico exist for these artifacts
โข Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
โข Canโt expect researchers to (also) upload all artifacts in IRs
โข Web archives only incidentally archive these artifacts
โข Anecdotal & Hiberlink evidence
Research and Research Communication on the Web
12. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Hiberlink Evidence
Web resources referenced in Elsevier corpus (1996-2012)
without representative Memento in public web archives
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
13. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Need for an Archiving Infrastructure
Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web
https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
14. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Recording versus Archiving
Recording Archiving
Short-term Longer-term
No guarantees provided Attempt to provide guarantees
Write many/read many Write once/Read many
Scholarly process Scholarly record
Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web
https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
16. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Scholarly Orphans Project
โข Funded by the Andrew W. Mellon Foundation
โข Los Alamos National Laboratory & New Mexico Consortium
โข Old Dominion University
โข 04/2016 - 03/2019
โข How to capture Scholarly Orphans (i.e., the scholarly artifacts
deposited in web portals) for long-term archiving?
โข Experimental project, aimed at exploring technical possibilities
17. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Scholarly Orphans Project
โข Explores an institution-driven paradigm
โข Academic institutions typically have a long shelf life
โข A basic premise underlying e.g., LOCKSS, perma.cc
โข An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
โข Collecting and archiving such artifacts aligns with the
mission of academic libraries
19. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Scholarly Orphans Project
โข Explores a paradigm inspired by web archiving
โข Scale of the problem
โข Canโt expect researchers to upload all artifacts in an institutional
repository
โข Bilateral agreements for archival purposes with most web
portals unlikely
21. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Inspiration
โข LOCKSS
โข Web crawling approach
โข Focused on journal literature
โข Archive-It
โข On-demand, subscription-based web archiving
โข Not focused on scholarly orphans
โข Institutional repository, auto-discovery of journal articles
โข Capture an institutionโs output
โข Focused on journal literature
โข The Locker Project & Amy Guyโs Personal Web Observatory work
โข Capture an individualโs web presence
โข Not focused on scholarly orphans
http://rhiaro.co.uk/
https://rhiaro.github.io/thesis/
26. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
myresearch.institute - Researchers
โข Uniquely identified by ORCIDs
โข Web identities in multiple portals
โข Create various types of artifacts
27. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
myresearch.institute - Portals
โข Tracking started August 27 2018
โข Tracking artifacts created starting
August 1 2018
โข >2,200 artifacts tracked to date
for all 16 researchers
30. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Description
โข In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
โข The web identity of the researcher in the portal
โข Algorithmic discovery
โข Discovery via a registry
31. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Algorithmic Discovery of Web Identities
James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014)
EgoSystem: Where are our alumni? In: code4lib http://journal.code4lib.org/articles/9519
32. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Martin Klein and Herbert Van de Sompel (2017)
Discovering Scholarly Orphans Using ORCID In: JCDL2017 https://arxiv.org/abs/1703.09343
Discovery of Web Identities via a Registry (ORCID)
35. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Description
โข In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
โข The web identity of the researcher in the portal
โข Algorithmic discovery
โข Discovery via a registry
โข A portal API that supports:
โข Access by web identity
โข Access to contributions โsince โฆโ for the web identity
โข Result of tracking:
โข URI(s) of new artifact(s) discovered in the portal
39. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Challenges
โข Discovery of web identities of researchers
โข Algorithmic, registry-based currently not adequate
โข Fallback: manual discovery and entry
โข With help of researcher
โข Portal API access by web identity
โข Broadly supported by general purpose portals
โข Typically not supported by scholarly portals
โข Some lack an API altogether
โข Should add ORCID access to APIs
โข OAI-PMH and ResourceSync need sets per web identity
โข Professional versus personal contributions
โข Tracking frequency/scale
41. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Description
โข The capture process takes as input the URI of a new artifact
discovered in a portal
โข Its task is to create a representative institutional capture of the
artifact
โข Result of capture:
โข WARC file for new artifact in an institutional archive
42. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Description
โข Challenges:
โข Delineate the web boundary of the artifact
โข More than the input artifact URI
โข The boundary is in the eye of the beholder
โข Create a high-fidelity capture using an approach that scales for
a steady stream of new artifacts
โข Unsolved problem
50. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Challenges
โข Memento Tracer:
โข Language used to express Traces (interoperability)
โข Organization of the shared repository for Traces
โข Limitations of the browser event listener approach for recording
Traces
โข Selection of a Trace for capturing a web publication by other
means than URI pattern
โข Legal constraints
52. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts - Description
โข The archiving process takes as input the URI of a WARC file
generated by the capture process
โข Its task is to ingest the WARC file in a cross-institutional web archive
โข This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
โข Result of archiving:
โข Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
56. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts - Challenges
โข Attempted to use ipwb, a pywb version that uses IPFS
โข Cross-institutional distributed file system with redundancy
โข Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
58. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Summary (1/2)
โข The Scholarly Orphans project explores an institution-driven
approach to capture scholarly artifacts deposited in web portals
โข Artifacts out of scope of existing archival approaches such as
LOCKSS, Portico, web archives
โข Institutions have a long shelf life, should be interested in
collecting these artifacts, and have feasible scale for
identity/artifact discovery
59. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Summary (2/2)
โข Components of the experimental pipeline:
โข Tracker: Automatically discover artifacts because researchers
will not upload them to the institution
โข Capturer: High fidelity artifact captures through crowd-sourcing
navigation patterns with Memento Tracer
โข Archiver: Cross-institutional, Memento-compliant scholarly web
archive
60. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Acknowledgments
โข Los Alamos National Laboratory:
โข Lyudmila Balakireva
โข Martin Klein
โข James Powell
โข Harihar Shankar
โข Herbert Van de Sompel
โข Old Dominion University:
โข Sawood Alam
โข Grant Atkins
โข Shawn Jones
โข Mat Kelly
โข Michael L. Nelson
โข myresearch.institute โ all volunteering researchers
61. @mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Martin Klein
Los Alamos National Laboratory
@mart1nkle1n
https://orcid.org/0000-0003-0130-2097
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
https://orcid.org/0000-0002-0715-6126
A Web-Centric Pipeline for Archiving Scholarly Artifacts
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation