SlideShare a Scribd company logo
1 of 175
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync:
A Web-Based
Resource Synchronization
Framework
ResourceSync is funded by
The Sloan Foundation & JISC
#resourcesync
1
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Tutorial History
Simeon Warner
Cornell University
simeon.warner@cornell.edu
@zimeon
• First outing: OAI8, Geneva, Switzerland, June 2013
• Second run: Open Repositories, here and now
• Most recent version of these tutorial slides is available at:
http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
Presenter
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Martin Klein
Los Alamos National Laboratory
<martinklein0815@gmail.com>
@mart1nkle1n
ResourceSync Tutorial Contributors
3
Simeon Warner
Cornell University
simeon.warner@cornell.edu
@zimeon
Herbert Van de Sompel
Los Alamos National Laboratory
<hvdsomp@gmail.com>
@hvdsomp
Robert Sanderson
Los Alamos National Laboratory
<azaroth24@gmail.com>
@azaroth24
Richard Jones
Cottage Labs
<richard@cottagelabs.com>
@cottagelabs
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync
Core Team
4
OAI
Herbert Van de Sompel
Martin Klein
Robert Sanderson
(Los Alamos National Laboratory)
Simeon Warner
(Cornell University)
Berhard Haslhofer
(University of Vienna)
Michael L. Nelson
(Old Dominion University)
Carl Lagoze
(University of Michigan)
NISO
Todd Carpenter
Nettie Lagace
Lyrasis
Peter Murray
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Technical Group
5
JISC
Richard Jones
Graham Klyne
Stuart Lewis
OCLC
Jeff Young
LOCKSS
David Rosenthal
RedHat
Christian Sadilek
Ex Libris Inc.
Shlomo Sanders
Library of Congress
Kevin Ford
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
6
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach
7
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Synchronize What?
• Web resources
o things with a URI that can be dereferenced
• Focus on needs of research communication and cultural heritage
organizations
o but aim for generality
8
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Synchronize What?
• Small websites/repositories (a few resources) to large
repositories/datasets/linked data collections (many millions of
resources)
9
sync
sync
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Synchronize What?
10
• Low change frequency (weeks/months) to high change
frequency (seconds)
sync
sync
sync
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Synchronize What?
11
• Synchronization latency and accuracy needs may vary
sync
Sync ???
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Why?
… because lots of projects and services are doing synchronization
but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos but
o XML metadata only
o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful.
o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
12
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Problem
13
• Consideration:
• Source (server) A has resources that change over time: they
get created, modified, deleted
• Destination (servers) X, Y, and Z leverage (some)
resources of Source A.
• Problem:
• Destinations want to keep in step with the resource changes
at Source A: resource synchronization.
• Goal:
• Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption
by different communities.
• The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source: Four Core
Synchronization Capabilities
1. Describing content – publish a list of resources available for
synchronization to enable Destinations to perform an initial load
or catch-up with a Source
2. Packaging content – bundle resources to enable bulk download
for destinations
3. Describing changes – publish a list of resource changes to
enable destinations to stay synchronized and decrease latency
4. Packaging changes – bundle resource changes for bulk
download for destinations
14
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source: Synchronization Features
5. Linking to related resources – provide links from to be
synchronized resources to related resources
 applicable to all core capabilities (1..4)
6. Access to historical data – provide archives of 1..4
7. Discovery of capabilities – support Destinations in discovering
all offered capabilities 1..4
15
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Destination: Synchronization Needs
1. Baseline synchronization – A destination must be able to
perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some
way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete
- allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is
synchronized with a source
- subject to some latency
16
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
2. Motivation & Use Cases
17
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Cases – The Basics
18
a)
b)
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Cases – The Basics
19
c)
d)
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Cases – The not-so-Basics
20
e)
f)
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Cases – The not-so-Basics
21
g)
h)
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 1: arXiv Mirroring and Data Sharing
• Repository of scholarly articles in physics,
mathematics, computer science, etc.
• > 850k articles
• approx. 1.5 revisions per article on
average
• approx. 75k new articles per year
• Each article has full-text and separate
metadata record
• approx. 3.8M resources
22
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 1: arXiv Mirroring and Data Sharing
• 2,700 updates daily
o at 8pm EST
o Currently using homebrew mirroring
solution (running with minor
modifications since 1994!)
o occasional rsync (file system-specific,
auth issues)
23
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Mirroring arXiv: 1994 - 2013
• Operated since the very early days of the
Web!
1. HTTP trigger from the main site
2. HTTP pull update specific to mirror
site
3. HTTP download of the resources
4. HTTP trigger to main site when mirror
process complete
5. HTTP verification (via HEAD) by the
main site which updates the update
list specific to mirror site
6. periodic repeat as long as there are
updates in the inventory for that
mirror
• Requires trusted set of servers operating
with the same internal organization
• Does not support synchronization check
(so rsync is used periodically)
24
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 1: arXiv
Mirroring
• GOAL: Keep mirror sites synchronized with daily
changes
• WANT:
o high consistency
o moderate latency
o robustness to global network outages (low admin effort)
o ability to verify sync status in case of questions
25
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 1: arXiv
Data Sharing
• GOAL: Make resources and update information
publicly available so that any other service may
synchronize at the frequency it needs, e.g.
o Math Front at UC Davis
o EprintWeb from IOP in UK
o Data for bibliometric and scientometric analysis
• WANT:
o low admin effort (i.e. standard approach, standard tools)
o reasonable consistency, latency, efficiency
26
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 2: DBpedia Live Duplication
• Average of 2 updates per second
• Low latency desirable => need for a push technology
27
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 2: DBpedia Live Duplication
• Initial experiment with distributed infrastructure
28
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 2: DBpedia Live Duplication
• Daily traffic:
o 99% updates
o 0.6% deletions
o 0.03% creations
29
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Use Case 2: DBpedia Live Duplication
• # of content transfer
events in two 8 hour
intervals
• Max, queue size of
remote duplication
process
30
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
3. Framework Walkthrough
31
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations
to know about, it may describe them:
o Publish a Resource List, a list of resource URIs and possibly
associated metadata
- Destination GETs the Content Description
- Destination GETs listed resources by their URI
o Describes state of set of resources at one point in time
(snapshot)
32
33
34
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source Capability 2: Packaging Content
By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:
o Publish a Resource Dump, a document that points to
packages of resource representations and necessary
metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
o Packages set of resources at one point in time (snapshot)
35
36
37
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source:
Modular Capabilities
38
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source Capability 3: Describing Changes
In order to achieve lower latency and/or greater efficiency, a source
may communicate about changes to its resources:
o Publish a Change List, a list of recent change events
(created, updated, deleted resource)
- Destination acts upon change events, e.g. GETs
created/updated resources, removes deleted resources.
o Describes changes to resources that occurred in a temporal
interval with a start- and an end-date
39
40
41
42
43
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source Capability 4: Packaging Changes
In order to reduce the number of requests to obtain resource
changes, a source may provide packaged bitstreams for changed
resources:
o Publish a Change Dump, a document that points to
packages of recently changed resource representations and
necessary metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
o Packages resources that changed in a temporal interval with
a start- and an end-date
44
45
46
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Source:
Modular Capabilities
47
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Framework
Structure
(light)
48
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Framework
Structure
(complete)
49
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Destination: Key Processes
50
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
51
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Pull method
3. Linking between resources
4. Discovery
5. Push method
6. Archives
52
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
53
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
So Many Choices
54
XMPP
AtomPub
SDShare
RSS
Atom
PubSubHubbub
Sitemap
XMPP
rsync
OAI-PMH
WebDAV Col. Syn.
OAI-ORE
DSNotify
RDFsync
Crawl
Push
Pull
SWORD
SPARQLpush
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
So Many Choices
55
XMPP
AtomPub
SDShare
RSS
Atom
PubSubHubbub
Sitemap
XMPP
rsync
OAI-PMH
WebDAV Col. Syn.
OAI-ORE
DSNotify
RDFsync
Crawl
Push
Pull
SWORD
SPARQLpush
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
56
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
A Framework Based on Sitemaps
• Modular framework allowing selective deployment
• Sitemap is the core format throughout the framework
o Introduce extension elements and attributes:
- In ResourceSync namespace (rs:) to
accommodate synchronization needs
o Reuse Sitemap format for all capability documents:
Resource List, Resource Dump, Change List,
Change Dump, as well as for manifest in Dumps
o Utilize Sitemap index format where
needed/allowed
57
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Sitemap Format
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
</url>
…
</urlset>
58
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Sitemap Index Format
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<sitemap>
<loc>http://example.com/sitemap1.xml</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/sitemap2.xml</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
</sitemap>
…
</sitemapindex>
59
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Sitemap Extensions
<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9
xmlns:rs="http://www.openarchives.org/rs/terms/”>
<rs:ln …/>
<rs:md …/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:ln …/>
<rs:md …/>
</url>
<url>
…
</url>
</urlset>
60
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Sitemap Extensions
<sitemapindex xmlns=http://www.sitemaps.org/schemas/sitemap/0.9
xmlns:rs="http://www.openarchives.org/rs/terms/”>
<rs:ln …/>
<rs:md …/>
<sitemap>
<loc>http://example.com/sitemap1.xml</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:ln …/>
<rs:md …/>
</sitemap>
…
</sitemapindex>
61
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
2. Pull method
62
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 1: Resource List
63
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 1: Resource List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
from="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>
64
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource List
• Describe Source’s resources that are subject to synchronization
• At one point in time (snapshot)
• Typical Destination use: Baseline Synchronization, Audit
• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @from to determine freshness
• Issue GETs against URIs to obtain resources
• Very similar to current Sitemaps
65
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
What if I have a million resources?
• Current sitemap limit is 50k resources (or maximum document
size of 50MB)
• Break complete list of resources into 50k-resource chunks, each
on a Resource List document
• Create a Resource List Index document to group them:
o Based on <sitemapindex>
o May have up to 50k component Resource Lists
o Extends capacity to 2,500,000,000 resources within current
community practices
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource List Index <resourcelist_index.xml>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcelist"
from="2013-01-02T09:00:02Z”/>
<sitemap>
<loc>http://example.com/resourcelist1.xml</loc>
<lastmod>2013-01-02T11:00:00Z</lastmod>
<rs:md type="application/xml"/>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelist2.xml</loc>
<lastmod>2013-01-02T11:00:01Z</lastmod>
<rs:md type="application/xml"/>
</sitemap>
</urlset>
67
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource List <resourcelist1.xml>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs=http://www.openarchives.org/rs/terms/>
<rs:ln rel=”up”
href=”http://example.com/resourcelist_index.xml”/>
<rs:md capability=”resourcelist"
from="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T08:07:06Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
...
</urlset>
68
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 2: Resource Dump
69
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 2: Resource Dump
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcedump"
from="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/resourcedump_part1.zip</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md length=”97553"
type=”application/zip"/>
</url>
<url>
<loc>http://example.com/resourcedump_part2.zip</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md length=”21294"
type=”application/zip"/>
</url>
</urlset>
70
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource Dump Manifest
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcedump-manifest"
from="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md type="text/html"
path=”/resources/res1"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md type=”application/pdf”
path=”/resources/res2"/>
</url>
</urlset>
71
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource Dump
• Package Source’s resourcesthat are subject to synchronization
• At one point in time (snapshot)
• Points to ZIP packages
• Mandatory, even for only one ZIP
• ZIP package contains manifest, listing contained bitstreams
• Typical Destination use: Baseline Synchronization, bulk
download
• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @from to determine freshness
• GETs against individual URIs from Resource List achieves the
same result (ignoring varying freshness)
72
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 3: Change List
73
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 3: Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>
74
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change List
• Describe Source’s resource changes
• Occurring during temporal interval with start- and end-date
• Typical Destination use: Incremental Synchronization, Audit
• Changes are listed in chronological order
• Multiple changes to one URI may result in multiple listing of
same URI
• Source determines duration of temporal interval
• Destinations use @from and @until to determine freshness
• Issue GETs against URIs to obtain changed resources
75
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 4: Change Dump
76
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability 4: Change Dump
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changedump"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/change_dump_part1.zip</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md length="887"
type=”application/zip"/>
</url>
<url>
<loc>http://example.com/change_dump_part2.zip</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md length=”9767"
type=”application/zip"/>
</url></urlset>
77
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change Dump Manifest
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changedump-manifest"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated"
length=”2887”
type=”text/html”
path=”changes/res1”/>
</url>
<url>
…
</url>
</urlset>
78
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change Dump
• Package Source’s resources that have changed
• during temporal interval with start- and end-date
• Points to ZIP packages
• Mandatory, even for only one ZIP
• ZIP package contains manifest, listing contained bitstreams
• Typical Destination use: Incremental Synchronization, bulk
download of changes
• Changes in Change Dump Manifest listed in chronological order
• Same URI can be listed multiple times
• Might be expensive to generate
• Destinations use @from and @until to determine freshness
79
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Recall… *Index
<changelist_index.xml>
<changelist1.xml>
80
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change List Index <changelist_index.xml>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<sitemap>
<loc>http://example.com/changelist1.xml</loc>
<lastmod>2013-01-02T11:00:00Z</lastmod>
<rs:md type="application/xml"/>
</sitemap>
<sitemap>
<loc>http://example.com/changelist2.xml</loc>
<lastmod>2013-01-02T23:00:00Z</lastmod>
<rs:md type="application/xml"/>
</sitemap>
</urlset>
81
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change List <changelist1.xml>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs=http://www.openarchives.org/rs/terms/>
<rs:ln rel=”up”
href=”http://example.com/changelist_index.xml”/>
<rs:md capability="changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-02T21:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
</urlset>
82
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource Metadata Summary
Element/Attribute Description Defined by
<loc> Resource URI (identity) sitemaps
<lastmod> Timestamp of last change sitemaps
<changefreq> Expected update frequency sitemaps
<rs:md> ResourceSync
change
Change type (Change List & Change
Dump Manifest only)
ResourceSync
encoding
HTTP Content-Encoding header value RFC2616
hash
One or more content digests (md5, sha-1,
sha-256)
Atom Link Ext.
length
HTTP Content-Length header value RFC4287
path
Path in ZIP package (Dump Manifests
only)
ResourceSync
type
HTTP Content-Type header value RFC4287
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
3. Linking between resources
84
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Supported Linking Use Cases
The web is based on links between resources, many of which are
important to understand for synchronization.
1. Mirrored content with multiple download locations
2. Alternate representations of the same content
3. Patching content rather than replacing
4. Resources and their metadata
5. Prior versions of resources
6. Collection membership of resources
7. Republishing synchronized resources
All cases are handled with a <rs:ln> element referring to the remote
resource
85
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Notes about Linked Resources
Some important things to keep in mind about linked resources:
• They may also be subject to synchronization
• They may be updated in a very different schedule to the resource
it is linked from
• Therefore, it is recommended to convey metadata about the
linked resource too
• Links can be bi-directional – the linked resource can link back to
the linking resource
86
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #1 - Mirror
1. Mirrored content with multiple download locations
This might occur due to:
• Content distribution networks
• Mirror sites
• Backup locations
• Load balancing
87
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #1 - Mirror
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”duplicate”
pri=”1”
href=”http://mirror1.example.com/res1"/>
<rs:ln rel=”duplicate”
pri=”2”
href=”http://mirror2.example.com/res1"/>
</url>
</urlset>
88
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #2 – Alternate Representations
2. Alternate representations of the same content
This might occur due to:
• Server supports HTTP content negotiation
• Multiple copies of the same resource
• Format migration for preservation reasons
• Different clients wanting different formats
• Multiple languages of the content
89
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #2 – Alternate Representations
90
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel="alternate"
type="text/html"
href="http://example.com/res1.html"/>
<rs:ln rel="alternate"
type=“application/pdf"
href=”http://example.com/res1.pdf"/>
</url>
</urlset>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #2 – Alternate Representations
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1.html</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”canonical”
href="http://example.com/res1"/>
</url>
</urlset>
91
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #3 – Patching Content
3. Patching content rather than replacing
This might occur due to:
• Resources are very large and server wishes to conserve
bandwidth where possible
• Changes are frequent and small
• Changes are managed in a CMS that tracks differences
• Format exists or can be described that is machine
processable to replicate the change
92
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #3 – Patching Content
93
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1.json</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”
length=“398723”/>
<rs:ln rel=”http://www.openarchives.org/rs/terms/patch”
type=”application/json-patch”
modified=“2013-01-02T17:00:00Z”
length=“58”
href=”http://example.com/res1-patch.json"/>
</url>
</urlset>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #4 – Metadata about Resources
4. Resources and their metadata
This might occur due to:
• Resources have additional metadata records, which are
useful for understanding the resource
• Such as cultural heritage images, audio, video
• Collections with descriptive metadata
• Resources with technical metadata
• Administrative or Rights metadata
94
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #4 – Metadata about Resources
95
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”describedby”
type=”application/xml”
href=”http://example.com/metadata/res1.xml"/>
</url>
</urlset>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #4 – Metadata about Resources
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/metadata/res1.xml</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”describes”
type=”text/html”
href=”http://example.com/res1"/>
</url>
</urlset>
96
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #5 – Prior Versions of Resources
But first…
97
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Memento Intermezzo
http://www.mementoweb.org/
URI for Original, URI for Version
URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/
Web Archive
URI-R - http://www.cnn.com/
URI for Original, URI for Version
URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333
CMS
URI-R - http://en.wikipedia.org/wiki/September_11_attacks
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #5 – Prior Versions of Resources
107
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”memento”
href=”http://example.com/past/20130102130000/res1"/>
<rs:ln rel=”timegate”
href=”http://example.com/timegate/res1"/>
<rs:ln rel=”timemap”
href=“http://example.com/timemap/res1”
type=“application/link-format”/>
</url>
</urlset>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #6 – Collection Membership
6. Collection membership of resources
A source might want to express this because:
• Resources are part of OAI-ORE aggregations
• Resources are part of OAI-PMH sets
• Or to indicated any other type of collections of resources
Collections are named with URIs and can then be linked to with
rel=“collection”
• Nice if the collection URI resolves to a useful description
108
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #6 – Collection Membership
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”collection”
href=”http://example.com/aggregation/allres"/>
</url>
</urlset>
109
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #7 – Republishing Resources
7. Republishing synchronized resources
This might occur due to:
• Aggregator systems that harvest resources from remote sites
and then republish them at new URIs
• Examples include Blog republishing, content distribution
networks, mirrored or combined collections
• Hypothetical scenario: Lots of little museums with small
collections, and a large European/American aggregating
digital library system that wants to provide fast, combined
access to the content (with permission)
110
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #7 – Republishing Resources
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”via”
modified=“2013-01-02T10:00:00Z”
href=”http://original.example.org/res1"/>
</url>
</urlset>
111
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Linking #7 – Republishing Resources
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://aggregator.example.com/res1</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”via”
modified=“2013-01-02T13:00:00Z”
href=”http://example.org/res1"/>
<rs:ln rel=”via”
modified=“2013-01-02T10:00:00Z”
href=”http://original.example.org/res1"/>
</url>
</urlset>
112
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Link Relation Summary
Relation Use in ResourceSync Defined in
rel="alternate"
Link from generic to specific
URI HTML 5
rel="canonical"
Link from specific to generic
URI RFC6596
rel="collection"
Resource is member of
collection RFC6573
rel="describedby" Has metadata
Protocol for Web Description Resources
(POWDER): Description Resources
rel="describes" Is metadata for The 'describes' Link Relation Type
rel="duplicate" Mirror or alternative copy RFC6249
rel=".../rs/terms/patch"
A patch -- efficient change
information This specification
rel="memento" Link to time-specific URI Memento Internet Draft
rel="timegate" Link to timegate Memento Internet Draft
rel="via" Provenance chain, came from RFC4287
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Related Resource Metadata Summary
• Attributes of the <rs:ln> element; c.f. resource metadata + pri
Element/Attribute Description Defined by
<rs:ln> ResourceSync
encoding HTTP Content-Encoding header value RFC2616
hash One or more content digests (md5, sha-1, sha-256) Atom Link Ext.
href Related resource URI (identity) RFC4287
length HTTP Content-Length header value RFC4287
modified Timestamp of last change (c.f. <lastmod>) Atom Link Ext.
path Path in ZIP package (Dump Manifests only) ResourceSync
pri Priority of link RFC6249
rel Relation - IANA registered or URI RFC4287
type HTTP Content-Type header value RFC4287
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
4. Discovery
115
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Discovery of
Capabilities
116
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Discovery of Capability Documents
Requirements:
• Need to discover capability documents, i.e. Resource List,
Resource Dump, Change List, Change Dump, Archives
• Need to know the type of capability each document
represents.
Approach:
• The Capability List provides links to these capability documents,
if the Source supports them.
• These links have appropriate relation types, e.g.
“resourcelist”, “changelist”, etc.
117
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Capability List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”capabilitylist”/>
<rs:ln rel=“resourcesync”
href=“http://example.com/.well-known/resourcesync”/>
<url>
<loc>http://aggregator.example.com/dataset1/resourcelist.xml</loc>
<rs:md capability=”resourcelist”/>
</url>
<url>
<loc>http://aggregator.example.com/dataset1/changelist.xml</loc>
<rs:md capability=”changelist”/>
</url>
</urlset>
118
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
119
Requirements:
• Need to discover a Capability List
Approach:
• HTTP Link header from resources subject to synchronization,
relation type “resourcesync”
• Links from HTML document <head>, relation type “resourcesync”
• Links from Capability documents, relation type “up”
Link header on example.com/res1.pdf
Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync”
Discovery of Capability Lists
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Discovery of
Capabilities
120
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Discovery: ResourceSync Description
Requirements:
• Support for multiple Capability Lists, one per “set of
resources”
• Need to discover these Capability Lists
• Need descriptive information about each set of resources
that a Capability List pertains to
• Useful to have descriptive information about the Source itself
Approach:
• The ResourceSync Description document meets these
requirements.
• It should be at a particular location to avoid having registries:
http://(hostname)/.well-known/resourcesync
• It can be linked to from the Capability Lists as well.
121
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Discovery of
Capabilities
122
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Description
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcesync”/>
<rs:ln rel=“describedby”
href=“http://example.com/info_about_source.xml”/>
<url>
<loc>http://aggregator.example.com/dataset1/capabilitylist.xml</loc>
<rs:md capability=”capabilitylist”/>
<rs:ln rel=“describedby”
href=“http://example.com/dataset1/info_about_dataset1.xml”/>
</url>
<url>
<loc>http://aggregator.example.com/dataset2/capabilitylist.xml</loc>
<rs:md capability=”capabilitylist”/>
<rs:ln rel=“describedby”
href=“http://example.com/dataset2/info_about_dataset2.xml”/>
</url>
</urlset>
123
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Discovery of
Capabilities
124
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
5. Push method
125
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Motivation for a Push Component in
ResourceSync
126
• Reduce synchronization latency by having the Source push out
resource change information
• To avoid continuous pull of Change Lists by Destinations
• Share information about changes to the Source’s
ResourceSync implementation, e.g. announcement of new
Resource List, new Capability List, etc.
• To avoid continuous polling of e.g. Resource Lists,
ResourceSync Description
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Notification Types
127
• Events pertaining to a resource
• updated | created | deleted for a resource
• 3rd party defined events
• Events pertaining to a set of resources
• updated | created | deleted for a Resource List, Resource
Dump, Change List, Change Dump, Archives
• 3rd party defined events
• Events pertaining to the overall ResourceSync implementation
• updated | created | deleted for a Capability List,
ResourceSync Description
• 3rd party defined events
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Possible Push Technology: XMPP PubSub
128
Other technologies: WebSockets, HTTP callback
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Notification Payload
129
• Payload the same irrespective of transport protocol
• Use <urlset> as encapsulating element
• One <url> element per notification
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Notification Payload – Resource Update (XMPP)
130
<xmpp:iq from=“sender@example.com” to=“destination@example.org”
type=“set” id=“liAJUz3S”>
<xmpp:pubsub>
<xmpp:publish node=“resource_notification_channel”>
<xmpp:item id=“1234577”>
<sm:urlset xmlns:sm=“http://www.sitemaps.org/schemas/sitemap/0.9”
xmlns:rs=“http://www.openarchives.org/rs/terms/”>
<sm:url>
<sm:loc>http://example.com/res1</sm:loc>
<sm:lastmod>2013-01-02T14:00:00Z</sm:lastmod>
<rs:md change=“updated” hash=“md5:12324324jhhjl234234”
length=“987665” type=“application/pdf”/>
</sm:url>
</sm:urlset>
</xmpp:item>
</xmpp:publish>
</xmpp:pubsub>
</xmpp:iq>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Notification Payload – Capability Update (XMPP)
131
<xmpp:iq from=“sender@example.com” to=“destination@example.com”
type=“set” id=“liAJUz3S”>
<xmpp:pubsub>
<xmpp:publish node=“changelist_notification_channel”>
<xmpp:item id=“1234577”>
<sm:urlset xmlns:sm=“http://www.sitemaps.org/schemas/sitemap/0.9”
xmlns:rs=“http://www.openarchives.org/rs/terms/”>
<sm:url>
<sm:loc>http://example.com/dataset1/changelist.xml</sm:loc>
<sm:lastmod>2013-01-02T14:00:00Z</sm:lastmod>
<rs:md capability=“changelist” change=“updated”/>
</sm:url>
</sm:urlset>
</xmpp:item>
</xmpp:publish>
</xmpp:pubsub>
</xmpp:iq>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Push Technology Considerations
132
• Notification channels
• Multiple channels per Source to divide up notifications, e.g.
• a channel for changes pertaining to all resources that
belong to a set of resources
• a channel for changes to capabilities for a set of
resources
• Server-side filtering preferred over client-side
• Authentication/Authorization
• To subscribe/create channels
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Push Technology Considerations
133
• Delayed notification
• Insurance that Destination does not miss anything
• Discovery
• Links to channels e.g. from a Capability List
• Links from channels to other channels
• Provide channel metadata (transport protocol info etc.)
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
<urlset xmlns=“http://www.sitemaps.org/schemas/sitemap/0.9”
xmlns:rs=“http://www.openarchives.org/rs/terms/”>
…
<url>
<loc>xmpp:pubsub.example.com/dataset1?;node=resource_notification_cha
nnel</loc>
<rs:md capability=“resource-notification”/>
<rs:ln rel=“alternate”
href=“ws://example.com/dataset1/meta_notification_channel”/>
</url>
<url>
<loc>xmpp:pubsub.example.com/dataset1?;node=capability_notification_ch
annel</loc>
<rs:md capability=“capability-notification”/>
</url>
<url>
<loc>xmpp:pubsub.example.com/dataset1?;node=resourcesync_notification
_channel</loc>
Push Channel Discovery
134
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
4. Framework (Technical) Details
6. Archives
135
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Framework Component:
Archives
In order to allow a Source to hold on to historical data and
Destinations to catch up with events it has missed:
o Publish a
- Resource List Archive,
- Resource Dump Archive,
- Change List Archive, and/or a
- Change Dump Archive
o Documents, listing historical capability documents
136
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource List Archive
137
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist-archive"
from="2013-01-09T13:00:00Z"/>
<url>
<loc>http://example.com/resourcelist1.xml</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/resourcelist2.xml</loc>
<lastmod>2013-01-09T13:00:00Z</lastmod>
</url>
<url>
…
</url>
</urlset>
Resource List Archive
138
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Resource Dump Archive
139
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcedump-archive"
from="2013-02-10T03:00:00Z"/>
<url>
<loc>http://example.com/resourcedump1.xml</loc>
<lastmod>2013-01-10T03:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/resourcedump2.xml</loc>
<lastmod>2013-02-10T03:00:00Z</lastmod>
</url>
<url>
…
</url>
</urlset>
Resource Dump Archive
140
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change List Archive
141
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist-archive"
from="2013-02-01T23:00:00Z
until="2013-02-03T23:00:00Z"/>
<url>
<loc>http://example.com/changelist1.xml</loc>
<lastmod>2013-02-01T23:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/changelist2.xml</loc>
<lastmod>2013-02-02T23:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/changelist3.xml</loc>
<lastmod>2013-02-03T23:00:00Z</lastmod>
</url>
</urlset>
Change List Archive
142
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Change Dump Archive
143
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changedump-archive"
from="2013-02-10T03:00:00Z
until="2013-02-17T03:00:00Z"/>
<url>
<loc>http://example.com/changedump1.xml</loc>
<lastmod>2013-02-10T03:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/changedump2.xml</loc>
<lastmod>2013-02-17T03:00:00Z</lastmod>
</url>
<url>
…
</url>
</urlset>
Change Dump Archive
144
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
5. Implementation
145
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Implementation #1:
The Metadata Harvesting Use Case
146
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
The Metadata Harvesting Use Case
1. Identification of metadata records within a service
1. Use of standards in metadata formats
1. Incremental updates
1. Create, Update, Delete
1. Sets
147
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
The Metadata Harvesting Use Case
1. Identification of metadata records within a service
2. Use of standards in metadata formats
148
ResourceSync does not specifically care about metadata records, only
resources. It is up to the server to identify which of those resources
are metadata.
We are free to annotate a resource's entry with appropriate metadata
to indicate the format.
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
The Metadata Harvesting Use Case
3. Incremental updates
4. Create, Update, Delete
5. Sets
149
All resources that can be obtained from a change list will be annotated
with the kind of change that happened to them.
ResourceSync allows the server to publish lists of resources and
changes and indexes of those lists all annotated with metadata.
ResourceSync publishes changes as static documents. The client is
then free to walk up and down the change lists provided by the server.
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
(Required) Documents for
metadata harvesting use case
150
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Describing Metadata Resources
151
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
from="2013-05-05T13:00:00Z"/>
<url>
<loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc</loc>
<lastmod>2013-05-01T19:09:35Z</lastmod>
<changefreq>never</changefreq>
<rs:md type=”application/xml”/>
<rs:ln href="http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf"
rel="describes"/>
<rs:ln href="http://mydspace.edu/bitstream/123456789/7/2/image.jpg"
rel="describes"/>
<rs:ln href="http://mydspace.edu/123456789/3"
rel=”collection"/>
</url>
</urlset>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Describing Bitstream Resources
152
<urlset
…
<url>
<loc>http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf</loc>
<lastmod>2013-05-01T19:09:35Z</lastmod>
<changefreq>never</changefreq>
<rs:md hash="md5:75d0ea94097a05fce9aca5b079e2f209"
length="419805"
type="application/pdf"/>
<rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/qdc"
rel="describedby"/>
<rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/mets"
rel="describedby"/>
<rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/12/qdc"
rel="describedby"/>
<rs:ln href="http://mydspace.edu/123456789/2"
rel=”collection"/>
</url>
</urlset>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Serving Metadata Resources
153
http://mydspace.edu/dspace-rs/resource/123456789/7/qdc
ResourceSync webapp Item handle Metadata Format
metadata.formats = 
qdc = http://purl.org/dc/terms/, 
mets = http://www.loc.gov/METS/
metadata.types = 
qdc = application/xml, 
mets = application/xml
<loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc<loc>
<rs:md type="application/xml”/>
<rs:ln href="http://purl.org/dc/terms/"
rel="describedby"/>
<loc>http://mydspace.edu/dspace-rs/resource/123456789/7/mets</loc>
<rs:md type="application/xml”/>
<rs:ln href="http://www.loc.gov/METS/"
rel="describedby"/>
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Generating Documents
1. Initialise
Creates initial Capability List and Resource List documents
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -i
2. Update
Creates a new Change List which covers the period since the last Change List
was created
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -u
3. Rebase
A combination of both Initialise and Update.
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -r
154
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Usage of Resources by clients
155
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Impact on DSpace
156
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
URLs
• Stable identifiers for archived items
• Stable identifiers for unarchived items
• Stable identifiers for metadata resources (in their various formats)
• Stable identifiers for previous versions
Provenance
• History of changes to an item/bitstream
• Item/bitstream deletions (vs withdraw)
• Bitstream create/update dates
• Item create/update dates
157
?
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Versioning
• Access of previous versions of both metadata and bitstreams
• Stable identifiers for previous versions of both metadata and
bitstreams
Metadata Resources
• Metadata in a variety of formats
• Metadata as file/bitstream
158
?
?
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Admin Files
• ResourceSync documents (Resource Lists, Change Lists, etc)
• ResourceSync exports - Resource Dumps, Change Dumps
• Metadata exports in a number of formats
Scheduled Tasks
• Regular generation of RS documents
Complex Objects
• Item/bitstream relationships
• Collections of content
159
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Dspace Module:
https://github.com/CottageLabs/DSpaceResourceSync
depends on the common java library:
https://github.com/CottageLabs/ResourceSyncJava
PHP client:
https://github.com/stuartlewis/resync-php
depends on the SWORDv2 clienbt library:
https://github.com/swordapp/swordappv2-php-library/
Get the software!
160
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Implementation #2:
ResourceSync at arXiv.org
161
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync @ arXiv
• Use ResourceSync for both mirroring and public data access
o efficient updates
o ability to do periodic audits
o public synchronization capability
o reduce admin burden
• Likely start with metadata + source for mirroring use case (doing
experiments now)
• Open access use cases requires processed PDF also
• Some concerns about likely use/load…
162
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
163
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Alternate download location
• Likely want to separate machine accesses from human accesses to
preserve response time on main server
=> Use Mirrored Content part of spec
o <loc> specifies canonical URI
- e.g. http://arxiv.org/pdf/1306.1073v1.pdf
o <rs:ln rel=“duplicate”> specifies preferred download location
- e.g. http://export.arxiv.org/pdf/1306.1073v1.pdf
164
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
<url>
<loc>http://arxiv.org/pdf/1306.1073v1.pdf</loc>
<lastmod>2013-06-06T00:57:12Z</lastmod>
<rs:md hash="md5:e08e0c4e4d7b0895120014f0aa09e7c4"
length="287714” type=”application/pdf"/>
<rs:ln rel="duplicate”
pri="1"
href="http://export.arxiv.org/pdf/1306.1073v1.pdf"
modified="2013-06-06T02:00:59Z"/>
</url>
Alternate download location
165
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Getting a copy of arXiv
It might be as easy as:
166
(of course, you probably have to wait a while but it is nice to know ResourceSync is
stateless so one can efficiently restart)
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Python Library and Client
• Aim to provide library code implementing all ResourceSync
facilities for use in both source and destination implementations
o Designed for python 2.6 (RHEL6) and 2.7
o Will not work with python <= 2.5
• Client (resync) supports many destination operations, inspired
by the common Unix rsync program
• Client also supports some operations that might be useful in a
source, such as generation of static Resource Lists, or periodic
Change Lists (used in arXiv experiments)
• Explorer (resync-explorer) intended to allow easy inspection
of a source’s resource sets and capabilities
• Developed since ResourceSync v0.5, updated for v0.9
http://github.org/resync/resync
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync Source Simulator
• Python code using Tornado server
• Provides random set of resources of different sizes updated at a
particular rate
• Very useful for testing Destination code
http://github.com/resync/simulator
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync - Agenda
6. Q&A
169
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
170
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
171
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
172
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Timeline
• June 2013
o Version 0.9 of ResourceSync framework specification released
o Soliciting broad feedback
• July 2013
o Version 0.x of Push-based methods for ResourceSync
• Fall 2013
o Specification becomes NISO standard
173
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
Pointers
• Specification
http://www.openarchives.org/rs/
http://www.openarchives.org/rs/0.9/resourcesync
http://www.openarchives.org/rs/0.9/archives
• List for public comment
https://groups.google.com/d/forum/resourcesync
• Client and simulator code
http://github.org/resync/resync
http://github.org/resync/simulator
174
ResourceSync Tutorial
July 8, 2013, Open Repositories 2013, PEI, Canada
ResourceSync:
A Web-Based
Resource Synchronization
Framework
ResourceSync is funded by
The Sloan Foundation & JISC
#resourcesync
175

More Related Content

Viewers also liked

paradigma de linguagens de programação - clos/lisp
paradigma de linguagens de programação - clos/lispparadigma de linguagens de programação - clos/lisp
paradigma de linguagens de programação - clos/lisp
Diego Damasceno
 

Viewers also liked (20)

Managing Annotations (OR2016)
Managing Annotations (OR2016)Managing Annotations (OR2016)
Managing Annotations (OR2016)
 
IIIF, Linked Data and the Getty Vocabularies
IIIF, Linked Data and the Getty VocabulariesIIIF, Linked Data and the Getty Vocabularies
IIIF, Linked Data and the Getty Vocabularies
 
Community Challenges for Practical Linked Open Data - Linked Pasts keynote
Community Challenges for Practical Linked Open Data - Linked Pasts keynoteCommunity Challenges for Practical Linked Open Data - Linked Pasts keynote
Community Challenges for Practical Linked Open Data - Linked Pasts keynote
 
CXC CSEC Information Technology Multiple Choice Questions
CXC CSEC Information Technology Multiple Choice QuestionsCXC CSEC Information Technology Multiple Choice Questions
CXC CSEC Information Technology Multiple Choice Questions
 
paradigma de linguagens de programação - clos/lisp
paradigma de linguagens de programação - clos/lispparadigma de linguagens de programação - clos/lisp
paradigma de linguagens de programação - clos/lisp
 
2013-Wingle-HotSpringHackathon(Winter)
2013-Wingle-HotSpringHackathon(Winter)2013-Wingle-HotSpringHackathon(Winter)
2013-Wingle-HotSpringHackathon(Winter)
 
Memoria del I encuentro de Emprendedores e Inversores "Open Bolivia"
Memoria del I encuentro de Emprendedores e Inversores "Open Bolivia"Memoria del I encuentro de Emprendedores e Inversores "Open Bolivia"
Memoria del I encuentro de Emprendedores e Inversores "Open Bolivia"
 
Maintenance Big Data Multi-Cloud Infrastructure: Notes from the Fields by Dzm...
Maintenance Big Data Multi-Cloud Infrastructure: Notes from the Fields by Dzm...Maintenance Big Data Multi-Cloud Infrastructure: Notes from the Fields by Dzm...
Maintenance Big Data Multi-Cloud Infrastructure: Notes from the Fields by Dzm...
 
A Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 PresentationA Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 Presentation
 
LexuesAcademy-全体まとめ
LexuesAcademy-全体まとめLexuesAcademy-全体まとめ
LexuesAcademy-全体まとめ
 
Operating system
Operating systemOperating system
Operating system
 
Presentación participantes
Presentación participantesPresentación participantes
Presentación participantes
 
2013-Wingle-HotSpringHackathon(Summer)
2013-Wingle-HotSpringHackathon(Summer)2013-Wingle-HotSpringHackathon(Summer)
2013-Wingle-HotSpringHackathon(Summer)
 
ResourceSync in 24x7
ResourceSync in 24x7ResourceSync in 24x7
ResourceSync in 24x7
 
Systematic Romantic
Systematic RomanticSystematic Romantic
Systematic Romantic
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked Data
 
FolkDance
FolkDanceFolkDance
FolkDance
 
HDConf Windows Server 2016 Containerization by Dzmitry Durasau
HDConf Windows Server 2016 Containerization by Dzmitry DurasauHDConf Windows Server 2016 Containerization by Dzmitry Durasau
HDConf Windows Server 2016 Containerization by Dzmitry Durasau
 
Boletin extra 11 Tesape Arandu
Boletin extra 11 Tesape AranduBoletin extra 11 Tesape Arandu
Boletin extra 11 Tesape Arandu
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
 

Similar to ResourceSync Tutorial from Open Repositories 2013

Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
Karlsruhe Institute of Technology (KIT)
 

Similar to ResourceSync Tutorial from Open Repositories 2013 (20)

NISO ResourceSync Training Session
NISO ResourceSync Training SessionNISO ResourceSync Training Session
NISO ResourceSync Training Session
 
ResourceSync Introduction at SWIB13
ResourceSync Introduction at SWIB13ResourceSync Introduction at SWIB13
ResourceSync Introduction at SWIB13
 
ResourceSync Tutorial
ResourceSync TutorialResourceSync Tutorial
ResourceSync Tutorial
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
Developing Infrastructure to Support Closer Collaboration of Aggregators with...
Developing Infrastructure to Support Closer Collaboration of Aggregators with...Developing Infrastructure to Support Closer Collaboration of Aggregators with...
Developing Infrastructure to Support Closer Collaboration of Aggregators with...
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Publishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecyclePublishing the Full Research Data Lifecycle
Publishing the Full Research Data Lifecycle
 
Technical integration of data repositories status and challenges
Technical integration of data repositories status and challengesTechnical integration of data repositories status and challenges
Technical integration of data repositories status and challenges
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...
 
Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014
 
Research in the Cloud
Research in the CloudResearch in the Cloud
Research in the Cloud
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Archivematica Community Update - SAA 2016
Archivematica Community Update - SAA 2016Archivematica Community Update - SAA 2016
Archivematica Community Update - SAA 2016
 
Repository Deposit Service Description
Repository Deposit Service DescriptionRepository Deposit Service Description
Repository Deposit Service Description
 
Jisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsJisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to Institutions
 
Jisc Publications Router
Jisc Publications RouterJisc Publications Router
Jisc Publications Router
 
Enhancing and testing repository deposit interfaces
Enhancing and testing repository deposit interfacesEnhancing and testing repository deposit interfaces
Enhancing and testing repository deposit interfaces
 
Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
 

More from Simeon Warner

Questioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataQuestioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the Data
Simeon Warner
 

More from Simeon Warner (20)

Questioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataQuestioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the Data
 
OCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation PersistenceOCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation Persistence
 
The Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservationThe Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservation
 
Welcome to the FOLIO Community
Welcome to the FOLIO CommunityWelcome to the FOLIO Community
Welcome to the FOLIO Community
 
Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging
 
FOLIO and Linked Data
FOLIO and Linked DataFOLIO and Linked Data
FOLIO and Linked Data
 
OCFL v1.0
OCFL v1.0OCFL v1.0
OCFL v1.0
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status Update
 
LKG Editor Dev
LKG Editor DevLKG Editor Dev
LKG Editor Dev
 
Don't bold the field name!
Don't bold the field name!Don't bold the field name!
Don't bold the field name!
 
Samvera and IIIF 2018
Samvera and IIIF 2018Samvera and IIIF 2018
Samvera and IIIF 2018
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 
ORCID @ Cornell
ORCID @ CornellORCID @ Cornell
ORCID @ Cornell
 
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
 
Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)
 
Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and Collaborations
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
 
ORCID & other Person iDs
ORCID & other Person iDsORCID & other Person iDs
ORCID & other Person iDs
 
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAFWho's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

ResourceSync Tutorial from Open Repositories 2013

  • 1. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync: A Web-Based Resource Synchronization Framework ResourceSync is funded by The Sloan Foundation & JISC #resourcesync 1
  • 2. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Tutorial History Simeon Warner Cornell University simeon.warner@cornell.edu @zimeon • First outing: OAI8, Geneva, Switzerland, June 2013 • Second run: Open Repositories, here and now • Most recent version of these tutorial slides is available at: http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial Presenter
  • 3. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Martin Klein Los Alamos National Laboratory <martinklein0815@gmail.com> @mart1nkle1n ResourceSync Tutorial Contributors 3 Simeon Warner Cornell University simeon.warner@cornell.edu @zimeon Herbert Van de Sompel Los Alamos National Laboratory <hvdsomp@gmail.com> @hvdsomp Robert Sanderson Los Alamos National Laboratory <azaroth24@gmail.com> @azaroth24 Richard Jones Cottage Labs <richard@cottagelabs.com> @cottagelabs
  • 4. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Core Team 4 OAI Herbert Van de Sompel Martin Klein Robert Sanderson (Los Alamos National Laboratory) Simeon Warner (Cornell University) Berhard Haslhofer (University of Vienna) Michael L. Nelson (Old Dominion University) Carl Lagoze (University of Michigan) NISO Todd Carpenter Nettie Lagace Lyrasis Peter Murray
  • 5. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Technical Group 5 JISC Richard Jones Graham Klyne Stuart Lewis OCLC Jeff Young LOCKSS David Rosenthal RedHat Christian Sadilek Ex Libris Inc. Shlomo Sanders Library of Congress Kevin Ford
  • 6. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 1. ResourceSync: Problem Perspective & Conceptual Approach 2. Motivation & Use Cases 3. Framework Walkthrough 4. Framework (Technical) Details 5. Implementation 6. Q&A 6
  • 7. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 1. ResourceSync: Problem Perspective & Conceptual Approach 7
  • 8. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Synchronize What? • Web resources o things with a URI that can be dereferenced • Focus on needs of research communication and cultural heritage organizations o but aim for generality 8
  • 9. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Synchronize What? • Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources) 9 sync sync
  • 10. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Synchronize What? 10 • Low change frequency (weeks/months) to high change frequency (seconds) sync sync sync
  • 11. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Synchronize What? 11 • Synchronization latency and accuracy needs may vary sync Sync ???
  • 12. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Why? … because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches! • Project team involved with projects that need this • Experience with OAI-PMH: widely used in repos but o XML metadata only o Attempts at synchronizing actual content via OAI-PMH (complex object formats, dc:identifier) not successful. o Web technology has moved on since 1999 • Devise a shared solution for data, metadata, linked data? 12
  • 13. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Problem 13 • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A. • Problem: • Destinations want to keep in step with the resource changes at Source A: resource synchronization. • Goal: • Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities. • The approach must scale better than recurrent HTTP HEAD/GET on resources.
  • 14. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source: Four Core Synchronization Capabilities 1. Describing content – publish a list of resources available for synchronization to enable Destinations to perform an initial load or catch-up with a Source 2. Packaging content – bundle resources to enable bulk download for destinations 3. Describing changes – publish a list of resource changes to enable destinations to stay synchronized and decrease latency 4. Packaging changes – bundle resource changes for bulk download for destinations 14
  • 15. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source: Synchronization Features 5. Linking to related resources – provide links from to be synchronized resources to related resources  applicable to all core capabilities (1..4) 6. Access to historical data – provide archives of 1..4 7. Discovery of capabilities – support Destinations in discovering all offered capabilities 1..4 15
  • 16. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Destination: Synchronization Needs 1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source - avoid out-of-band setup 2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source - subject to some latency; minimal: create/update/delete - allow to catch-up after destination has been offline 3. Audit – A destination should be able to determine whether it is synchronized with a source - subject to some latency 16
  • 17. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 2. Motivation & Use Cases 17
  • 18. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Cases – The Basics 18 a) b)
  • 19. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Cases – The Basics 19 c) d)
  • 20. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Cases – The not-so-Basics 20 e) f)
  • 21. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Cases – The not-so-Basics 21 g) h)
  • 22. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 1: arXiv Mirroring and Data Sharing • Repository of scholarly articles in physics, mathematics, computer science, etc. • > 850k articles • approx. 1.5 revisions per article on average • approx. 75k new articles per year • Each article has full-text and separate metadata record • approx. 3.8M resources 22
  • 23. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 1: arXiv Mirroring and Data Sharing • 2,700 updates daily o at 8pm EST o Currently using homebrew mirroring solution (running with minor modifications since 1994!) o occasional rsync (file system-specific, auth issues) 23
  • 24. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Mirroring arXiv: 1994 - 2013 • Operated since the very early days of the Web! 1. HTTP trigger from the main site 2. HTTP pull update specific to mirror site 3. HTTP download of the resources 4. HTTP trigger to main site when mirror process complete 5. HTTP verification (via HEAD) by the main site which updates the update list specific to mirror site 6. periodic repeat as long as there are updates in the inventory for that mirror • Requires trusted set of servers operating with the same internal organization • Does not support synchronization check (so rsync is used periodically) 24
  • 25. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 1: arXiv Mirroring • GOAL: Keep mirror sites synchronized with daily changes • WANT: o high consistency o moderate latency o robustness to global network outages (low admin effort) o ability to verify sync status in case of questions 25
  • 26. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 1: arXiv Data Sharing • GOAL: Make resources and update information publicly available so that any other service may synchronize at the frequency it needs, e.g. o Math Front at UC Davis o EprintWeb from IOP in UK o Data for bibliometric and scientometric analysis • WANT: o low admin effort (i.e. standard approach, standard tools) o reasonable consistency, latency, efficiency 26
  • 27. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 2: DBpedia Live Duplication • Average of 2 updates per second • Low latency desirable => need for a push technology 27
  • 28. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 2: DBpedia Live Duplication • Initial experiment with distributed infrastructure 28
  • 29. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 2: DBpedia Live Duplication • Daily traffic: o 99% updates o 0.6% deletions o 0.03% creations 29
  • 30. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Use Case 2: DBpedia Live Duplication • # of content transfer events in two 8 hour intervals • Max, queue size of remote duplication process 30
  • 31. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 3. Framework Walkthrough 31
  • 32. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source Capability 1: Describing Content In order to advertise the resources that a source wants destinations to know about, it may describe them: o Publish a Resource List, a list of resource URIs and possibly associated metadata - Destination GETs the Content Description - Destination GETs listed resources by their URI o Describes state of set of resources at one point in time (snapshot) 32
  • 33. 33
  • 34. 34
  • 35. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source Capability 2: Packaging Content By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms: o Publish a Resource Dump, a document that points to packages of resource representations and necessary metadata - Destination GETs the package - Destination unpacks the package - ZIP format supported o Packages set of resources at one point in time (snapshot) 35
  • 36. 36
  • 37. 37
  • 38. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source: Modular Capabilities 38
  • 39. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source Capability 3: Describing Changes In order to achieve lower latency and/or greater efficiency, a source may communicate about changes to its resources: o Publish a Change List, a list of recent change events (created, updated, deleted resource) - Destination acts upon change events, e.g. GETs created/updated resources, removes deleted resources. o Describes changes to resources that occurred in a temporal interval with a start- and an end-date 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. 43
  • 44. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source Capability 4: Packaging Changes In order to reduce the number of requests to obtain resource changes, a source may provide packaged bitstreams for changed resources: o Publish a Change Dump, a document that points to packages of recently changed resource representations and necessary metadata - Destination GETs the package - Destination unpacks the package - ZIP format supported o Packages resources that changed in a temporal interval with a start- and an end-date 44
  • 45. 45
  • 46. 46
  • 47. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Source: Modular Capabilities 47
  • 48. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Framework Structure (light) 48
  • 49. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Framework Structure (complete) 49
  • 50. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Destination: Key Processes 50
  • 51. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 51
  • 52. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Pull method 3. Linking between resources 4. Discovery 5. Push method 6. Archives 52
  • 53. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 53
  • 54. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada So Many Choices 54 XMPP AtomPub SDShare RSS Atom PubSubHubbub Sitemap XMPP rsync OAI-PMH WebDAV Col. Syn. OAI-ORE DSNotify RDFsync Crawl Push Pull SWORD SPARQLpush
  • 55. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada So Many Choices 55 XMPP AtomPub SDShare RSS Atom PubSubHubbub Sitemap XMPP rsync OAI-PMH WebDAV Col. Syn. OAI-ORE DSNotify RDFsync Crawl Push Pull SWORD SPARQLpush
  • 56. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada 56
  • 57. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada A Framework Based on Sitemaps • Modular framework allowing selective deployment • Sitemap is the core format throughout the framework o Introduce extension elements and attributes: - In ResourceSync namespace (rs:) to accommodate synchronization needs o Reuse Sitemap format for all capability documents: Resource List, Resource Dump, Change List, Change Dump, as well as for manifest in Dumps o Utilize Sitemap index format where needed/allowed 57
  • 58. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Sitemap Format <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> … </urlset> 58
  • 59. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Sitemap Index Format <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/sitemap2.xml</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </sitemap> … </sitemapindex> 59
  • 60. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Sitemap Extensions <urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </url> <url> … </url> </urlset> 60
  • 61. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Sitemap Extensions <sitemapindex xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/> <sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </sitemap> … </sitemapindex> 61
  • 62. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 2. Pull method 62
  • 63. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 1: Resource List 63
  • 64. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 1: Resource List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" from="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> 64
  • 65. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource List • Describe Source’s resources that are subject to synchronization • At one point in time (snapshot) • Typical Destination use: Baseline Synchronization, Audit • Each URI typically listed only once • Might be expensive to generate • Destinations use @from to determine freshness • Issue GETs against URIs to obtain resources • Very similar to current Sitemaps 65
  • 66. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada What if I have a million resources? • Current sitemap limit is 50k resources (or maximum document size of 50MB) • Break complete list of resources into 50k-resource chunks, each on a Resource List document • Create a Resource List Index document to group them: o Based on <sitemapindex> o May have up to 50k component Resource Lists o Extends capacity to 2,500,000,000 resources within current community practices
  • 67. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource List Index <resourcelist_index.xml> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcelist" from="2013-01-02T09:00:02Z”/> <sitemap> <loc>http://example.com/resourcelist1.xml</loc> <lastmod>2013-01-02T11:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap> <sitemap> <loc>http://example.com/resourcelist2.xml</loc> <lastmod>2013-01-02T11:00:01Z</lastmod> <rs:md type="application/xml"/> </sitemap> </urlset> 67
  • 68. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource List <resourcelist1.xml> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:ln rel=”up” href=”http://example.com/resourcelist_index.xml”/> <rs:md capability=”resourcelist" from="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T08:07:06Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> ... </urlset> 68
  • 69. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 2: Resource Dump 69
  • 70. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 2: Resource Dump <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump" from="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/resourcedump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”97553" type=”application/zip"/> </url> <url> <loc>http://example.com/resourcedump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”21294" type=”application/zip"/> </url> </urlset> 70
  • 71. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource Dump Manifest <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump-manifest" from="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type="text/html" path=”/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type=”application/pdf” path=”/resources/res2"/> </url> </urlset> 71
  • 72. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource Dump • Package Source’s resourcesthat are subject to synchronization • At one point in time (snapshot) • Points to ZIP packages • Mandatory, even for only one ZIP • ZIP package contains manifest, listing contained bitstreams • Typical Destination use: Baseline Synchronization, bulk download • Each URI typically listed only once • Might be expensive to generate • Destinations use @from to determine freshness • GETs against individual URIs from Resource List achieves the same result (ignoring varying freshness) 72
  • 73. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 3: Change List 73
  • 74. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 3: Change List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> 74
  • 75. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change List • Describe Source’s resource changes • Occurring during temporal interval with start- and end-date • Typical Destination use: Incremental Synchronization, Audit • Changes are listed in chronological order • Multiple changes to one URI may result in multiple listing of same URI • Source determines duration of temporal interval • Destinations use @from and @until to determine freshness • Issue GETs against URIs to obtain changed resources 75
  • 76. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 4: Change Dump 76
  • 77. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability 4: Change Dump <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/change_dump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length="887" type=”application/zip"/> </url> <url> <loc>http://example.com/change_dump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”9767" type=”application/zip"/> </url></urlset> 77
  • 78. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change Dump Manifest <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump-manifest" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" length=”2887” type=”text/html” path=”changes/res1”/> </url> <url> … </url> </urlset> 78
  • 79. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change Dump • Package Source’s resources that have changed • during temporal interval with start- and end-date • Points to ZIP packages • Mandatory, even for only one ZIP • ZIP package contains manifest, listing contained bitstreams • Typical Destination use: Incremental Synchronization, bulk download of changes • Changes in Change Dump Manifest listed in chronological order • Same URI can be listed multiple times • Might be expensive to generate • Destinations use @from and @until to determine freshness 79
  • 80. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Recall… *Index <changelist_index.xml> <changelist1.xml> 80
  • 81. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change List Index <changelist_index.xml> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <sitemap> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-01-02T11:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap> <sitemap> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-01-02T23:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap> </urlset> 81
  • 82. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change List <changelist1.xml> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:ln rel=”up” href=”http://example.com/changelist_index.xml”/> <rs:md capability="changelist" from="2013-01-02T09:00:00Z” until="2013-01-02T21:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> </urlset> 82
  • 83. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource Metadata Summary Element/Attribute Description Defined by <loc> Resource URI (identity) sitemaps <lastmod> Timestamp of last change sitemaps <changefreq> Expected update frequency sitemaps <rs:md> ResourceSync change Change type (Change List & Change Dump Manifest only) ResourceSync encoding HTTP Content-Encoding header value RFC2616 hash One or more content digests (md5, sha-1, sha-256) Atom Link Ext. length HTTP Content-Length header value RFC4287 path Path in ZIP package (Dump Manifests only) ResourceSync type HTTP Content-Type header value RFC4287
  • 84. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 3. Linking between resources 84
  • 85. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Supported Linking Use Cases The web is based on links between resources, many of which are important to understand for synchronization. 1. Mirrored content with multiple download locations 2. Alternate representations of the same content 3. Patching content rather than replacing 4. Resources and their metadata 5. Prior versions of resources 6. Collection membership of resources 7. Republishing synchronized resources All cases are handled with a <rs:ln> element referring to the remote resource 85
  • 86. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Notes about Linked Resources Some important things to keep in mind about linked resources: • They may also be subject to synchronization • They may be updated in a very different schedule to the resource it is linked from • Therefore, it is recommended to convey metadata about the linked resource too • Links can be bi-directional – the linked resource can link back to the linking resource 86
  • 87. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #1 - Mirror 1. Mirrored content with multiple download locations This might occur due to: • Content distribution networks • Mirror sites • Backup locations • Load balancing 87
  • 88. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #1 - Mirror <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”duplicate” pri=”1” href=”http://mirror1.example.com/res1"/> <rs:ln rel=”duplicate” pri=”2” href=”http://mirror2.example.com/res1"/> </url> </urlset> 88
  • 89. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #2 – Alternate Representations 2. Alternate representations of the same content This might occur due to: • Server supports HTTP content negotiation • Multiple copies of the same resource • Format migration for preservation reasons • Different clients wanting different formats • Multiple languages of the content 89
  • 90. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #2 – Alternate Representations 90 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel="alternate" type="text/html" href="http://example.com/res1.html"/> <rs:ln rel="alternate" type=“application/pdf" href=”http://example.com/res1.pdf"/> </url> </urlset>
  • 91. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #2 – Alternate Representations <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1.html</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”canonical” href="http://example.com/res1"/> </url> </urlset> 91
  • 92. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #3 – Patching Content 3. Patching content rather than replacing This might occur due to: • Resources are very large and server wishes to conserve bandwidth where possible • Changes are frequent and small • Changes are managed in a CMS that tracks differences • Format exists or can be described that is machine processable to replicate the change 92
  • 93. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #3 – Patching Content 93 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1.json</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated” length=“398723”/> <rs:ln rel=”http://www.openarchives.org/rs/terms/patch” type=”application/json-patch” modified=“2013-01-02T17:00:00Z” length=“58” href=”http://example.com/res1-patch.json"/> </url> </urlset>
  • 94. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #4 – Metadata about Resources 4. Resources and their metadata This might occur due to: • Resources have additional metadata records, which are useful for understanding the resource • Such as cultural heritage images, audio, video • Collections with descriptive metadata • Resources with technical metadata • Administrative or Rights metadata 94
  • 95. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #4 – Metadata about Resources 95 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”describedby” type=”application/xml” href=”http://example.com/metadata/res1.xml"/> </url> </urlset>
  • 96. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #4 – Metadata about Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/metadata/res1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”describes” type=”text/html” href=”http://example.com/res1"/> </url> </urlset> 96
  • 97. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #5 – Prior Versions of Resources But first… 97
  • 98. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Memento Intermezzo http://www.mementoweb.org/
  • 99. URI for Original, URI for Version URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/ Web Archive URI-R - http://www.cnn.com/
  • 100. URI for Original, URI for Version URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 CMS URI-R - http://en.wikipedia.org/wiki/September_11_attacks
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #5 – Prior Versions of Resources 107 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”memento” href=”http://example.com/past/20130102130000/res1"/> <rs:ln rel=”timegate” href=”http://example.com/timegate/res1"/> <rs:ln rel=”timemap” href=“http://example.com/timemap/res1” type=“application/link-format”/> </url> </urlset>
  • 108. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #6 – Collection Membership 6. Collection membership of resources A source might want to express this because: • Resources are part of OAI-ORE aggregations • Resources are part of OAI-PMH sets • Or to indicated any other type of collections of resources Collections are named with URIs and can then be linked to with rel=“collection” • Nice if the collection URI resolves to a useful description 108
  • 109. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #6 – Collection Membership <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”collection” href=”http://example.com/aggregation/allres"/> </url> </urlset> 109
  • 110. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #7 – Republishing Resources 7. Republishing synchronized resources This might occur due to: • Aggregator systems that harvest resources from remote sites and then republish them at new URIs • Examples include Blog republishing, content distribution networks, mirrored or combined collections • Hypothetical scenario: Lots of little museums with small collections, and a large European/American aggregating digital library system that wants to provide fast, combined access to the content (with permission) 110
  • 111. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #7 – Republishing Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”via” modified=“2013-01-02T10:00:00Z” href=”http://original.example.org/res1"/> </url> </urlset> 111
  • 112. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Linking #7 – Republishing Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://aggregator.example.com/res1</loc> <lastmod>2013-01-02T18:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”via” modified=“2013-01-02T13:00:00Z” href=”http://example.org/res1"/> <rs:ln rel=”via” modified=“2013-01-02T10:00:00Z” href=”http://original.example.org/res1"/> </url> </urlset> 112
  • 113. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Link Relation Summary Relation Use in ResourceSync Defined in rel="alternate" Link from generic to specific URI HTML 5 rel="canonical" Link from specific to generic URI RFC6596 rel="collection" Resource is member of collection RFC6573 rel="describedby" Has metadata Protocol for Web Description Resources (POWDER): Description Resources rel="describes" Is metadata for The 'describes' Link Relation Type rel="duplicate" Mirror or alternative copy RFC6249 rel=".../rs/terms/patch" A patch -- efficient change information This specification rel="memento" Link to time-specific URI Memento Internet Draft rel="timegate" Link to timegate Memento Internet Draft rel="via" Provenance chain, came from RFC4287
  • 114. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Related Resource Metadata Summary • Attributes of the <rs:ln> element; c.f. resource metadata + pri Element/Attribute Description Defined by <rs:ln> ResourceSync encoding HTTP Content-Encoding header value RFC2616 hash One or more content digests (md5, sha-1, sha-256) Atom Link Ext. href Related resource URI (identity) RFC4287 length HTTP Content-Length header value RFC4287 modified Timestamp of last change (c.f. <lastmod>) Atom Link Ext. path Path in ZIP package (Dump Manifests only) ResourceSync pri Priority of link RFC6249 rel Relation - IANA registered or URI RFC4287 type HTTP Content-Type header value RFC4287
  • 115. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 4. Discovery 115
  • 116. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Discovery of Capabilities 116
  • 117. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Discovery of Capability Documents Requirements: • Need to discover capability documents, i.e. Resource List, Resource Dump, Change List, Change Dump, Archives • Need to know the type of capability each document represents. Approach: • The Capability List provides links to these capability documents, if the Source supports them. • These links have appropriate relation types, e.g. “resourcelist”, “changelist”, etc. 117
  • 118. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Capability List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”capabilitylist”/> <rs:ln rel=“resourcesync” href=“http://example.com/.well-known/resourcesync”/> <url> <loc>http://aggregator.example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> <url> <loc>http://aggregator.example.com/dataset1/changelist.xml</loc> <rs:md capability=”changelist”/> </url> </urlset> 118
  • 119. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada 119 Requirements: • Need to discover a Capability List Approach: • HTTP Link header from resources subject to synchronization, relation type “resourcesync” • Links from HTML document <head>, relation type “resourcesync” • Links from Capability documents, relation type “up” Link header on example.com/res1.pdf Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync” Discovery of Capability Lists
  • 120. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Discovery of Capabilities 120
  • 121. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Discovery: ResourceSync Description Requirements: • Support for multiple Capability Lists, one per “set of resources” • Need to discover these Capability Lists • Need descriptive information about each set of resources that a Capability List pertains to • Useful to have descriptive information about the Source itself Approach: • The ResourceSync Description document meets these requirements. • It should be at a particular location to avoid having registries: http://(hostname)/.well-known/resourcesync • It can be linked to from the Capability Lists as well. 121
  • 122. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Discovery of Capabilities 122
  • 123. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Description <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcesync”/> <rs:ln rel=“describedby” href=“http://example.com/info_about_source.xml”/> <url> <loc>http://aggregator.example.com/dataset1/capabilitylist.xml</loc> <rs:md capability=”capabilitylist”/> <rs:ln rel=“describedby” href=“http://example.com/dataset1/info_about_dataset1.xml”/> </url> <url> <loc>http://aggregator.example.com/dataset2/capabilitylist.xml</loc> <rs:md capability=”capabilitylist”/> <rs:ln rel=“describedby” href=“http://example.com/dataset2/info_about_dataset2.xml”/> </url> </urlset> 123
  • 124. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Discovery of Capabilities 124
  • 125. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 5. Push method 125
  • 126. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Motivation for a Push Component in ResourceSync 126 • Reduce synchronization latency by having the Source push out resource change information • To avoid continuous pull of Change Lists by Destinations • Share information about changes to the Source’s ResourceSync implementation, e.g. announcement of new Resource List, new Capability List, etc. • To avoid continuous polling of e.g. Resource Lists, ResourceSync Description
  • 127. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Notification Types 127 • Events pertaining to a resource • updated | created | deleted for a resource • 3rd party defined events • Events pertaining to a set of resources • updated | created | deleted for a Resource List, Resource Dump, Change List, Change Dump, Archives • 3rd party defined events • Events pertaining to the overall ResourceSync implementation • updated | created | deleted for a Capability List, ResourceSync Description • 3rd party defined events
  • 128. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Possible Push Technology: XMPP PubSub 128 Other technologies: WebSockets, HTTP callback
  • 129. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Notification Payload 129 • Payload the same irrespective of transport protocol • Use <urlset> as encapsulating element • One <url> element per notification
  • 130. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Notification Payload – Resource Update (XMPP) 130 <xmpp:iq from=“sender@example.com” to=“destination@example.org” type=“set” id=“liAJUz3S”> <xmpp:pubsub> <xmpp:publish node=“resource_notification_channel”> <xmpp:item id=“1234577”> <sm:urlset xmlns:sm=“http://www.sitemaps.org/schemas/sitemap/0.9” xmlns:rs=“http://www.openarchives.org/rs/terms/”> <sm:url> <sm:loc>http://example.com/res1</sm:loc> <sm:lastmod>2013-01-02T14:00:00Z</sm:lastmod> <rs:md change=“updated” hash=“md5:12324324jhhjl234234” length=“987665” type=“application/pdf”/> </sm:url> </sm:urlset> </xmpp:item> </xmpp:publish> </xmpp:pubsub> </xmpp:iq>
  • 131. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Notification Payload – Capability Update (XMPP) 131 <xmpp:iq from=“sender@example.com” to=“destination@example.com” type=“set” id=“liAJUz3S”> <xmpp:pubsub> <xmpp:publish node=“changelist_notification_channel”> <xmpp:item id=“1234577”> <sm:urlset xmlns:sm=“http://www.sitemaps.org/schemas/sitemap/0.9” xmlns:rs=“http://www.openarchives.org/rs/terms/”> <sm:url> <sm:loc>http://example.com/dataset1/changelist.xml</sm:loc> <sm:lastmod>2013-01-02T14:00:00Z</sm:lastmod> <rs:md capability=“changelist” change=“updated”/> </sm:url> </sm:urlset> </xmpp:item> </xmpp:publish> </xmpp:pubsub> </xmpp:iq>
  • 132. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Push Technology Considerations 132 • Notification channels • Multiple channels per Source to divide up notifications, e.g. • a channel for changes pertaining to all resources that belong to a set of resources • a channel for changes to capabilities for a set of resources • Server-side filtering preferred over client-side • Authentication/Authorization • To subscribe/create channels
  • 133. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Push Technology Considerations 133 • Delayed notification • Insurance that Destination does not miss anything • Discovery • Links to channels e.g. from a Capability List • Links from channels to other channels • Provide channel metadata (transport protocol info etc.)
  • 134. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada <urlset xmlns=“http://www.sitemaps.org/schemas/sitemap/0.9” xmlns:rs=“http://www.openarchives.org/rs/terms/”> … <url> <loc>xmpp:pubsub.example.com/dataset1?;node=resource_notification_cha nnel</loc> <rs:md capability=“resource-notification”/> <rs:ln rel=“alternate” href=“ws://example.com/dataset1/meta_notification_channel”/> </url> <url> <loc>xmpp:pubsub.example.com/dataset1?;node=capability_notification_ch annel</loc> <rs:md capability=“capability-notification”/> </url> <url> <loc>xmpp:pubsub.example.com/dataset1?;node=resourcesync_notification _channel</loc> Push Channel Discovery 134
  • 135. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 4. Framework (Technical) Details 6. Archives 135
  • 136. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Framework Component: Archives In order to allow a Source to hold on to historical data and Destinations to catch up with events it has missed: o Publish a - Resource List Archive, - Resource Dump Archive, - Change List Archive, and/or a - Change Dump Archive o Documents, listing historical capability documents 136
  • 137. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource List Archive 137
  • 138. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist-archive" from="2013-01-09T13:00:00Z"/> <url> <loc>http://example.com/resourcelist1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/resourcelist2.xml</loc> <lastmod>2013-01-09T13:00:00Z</lastmod> </url> <url> … </url> </urlset> Resource List Archive 138
  • 139. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Resource Dump Archive 139
  • 140. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump-archive" from="2013-02-10T03:00:00Z"/> <url> <loc>http://example.com/resourcedump1.xml</loc> <lastmod>2013-01-10T03:00:00Z</lastmod> </url> <url> <loc>http://example.com/resourcedump2.xml</loc> <lastmod>2013-02-10T03:00:00Z</lastmod> </url> <url> … </url> </urlset> Resource Dump Archive 140
  • 141. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change List Archive 141
  • 142. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist-archive" from="2013-02-01T23:00:00Z until="2013-02-03T23:00:00Z"/> <url> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-02-01T23:00:00Z</lastmod> </url> <url> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-02-02T23:00:00Z</lastmod> </url> <url> <loc>http://example.com/changelist3.xml</loc> <lastmod>2013-02-03T23:00:00Z</lastmod> </url> </urlset> Change List Archive 142
  • 143. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Change Dump Archive 143
  • 144. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump-archive" from="2013-02-10T03:00:00Z until="2013-02-17T03:00:00Z"/> <url> <loc>http://example.com/changedump1.xml</loc> <lastmod>2013-02-10T03:00:00Z</lastmod> </url> <url> <loc>http://example.com/changedump2.xml</loc> <lastmod>2013-02-17T03:00:00Z</lastmod> </url> <url> … </url> </urlset> Change Dump Archive 144
  • 145. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 5. Implementation 145
  • 146. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Implementation #1: The Metadata Harvesting Use Case 146
  • 147. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada The Metadata Harvesting Use Case 1. Identification of metadata records within a service 1. Use of standards in metadata formats 1. Incremental updates 1. Create, Update, Delete 1. Sets 147
  • 148. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada The Metadata Harvesting Use Case 1. Identification of metadata records within a service 2. Use of standards in metadata formats 148 ResourceSync does not specifically care about metadata records, only resources. It is up to the server to identify which of those resources are metadata. We are free to annotate a resource's entry with appropriate metadata to indicate the format.
  • 149. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada The Metadata Harvesting Use Case 3. Incremental updates 4. Create, Update, Delete 5. Sets 149 All resources that can be obtained from a change list will be annotated with the kind of change that happened to them. ResourceSync allows the server to publish lists of resources and changes and indexes of those lists all annotated with metadata. ResourceSync publishes changes as static documents. The client is then free to walk up and down the change lists provided by the server.
  • 150. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada (Required) Documents for metadata harvesting use case 150
  • 151. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Describing Metadata Resources 151 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" from="2013-05-05T13:00:00Z"/> <url> <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc</loc> <lastmod>2013-05-01T19:09:35Z</lastmod> <changefreq>never</changefreq> <rs:md type=”application/xml”/> <rs:ln href="http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf" rel="describes"/> <rs:ln href="http://mydspace.edu/bitstream/123456789/7/2/image.jpg" rel="describes"/> <rs:ln href="http://mydspace.edu/123456789/3" rel=”collection"/> </url> </urlset>
  • 152. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Describing Bitstream Resources 152 <urlset … <url> <loc>http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf</loc> <lastmod>2013-05-01T19:09:35Z</lastmod> <changefreq>never</changefreq> <rs:md hash="md5:75d0ea94097a05fce9aca5b079e2f209" length="419805" type="application/pdf"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/qdc" rel="describedby"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/mets" rel="describedby"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/12/qdc" rel="describedby"/> <rs:ln href="http://mydspace.edu/123456789/2" rel=”collection"/> </url> </urlset>
  • 153. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Serving Metadata Resources 153 http://mydspace.edu/dspace-rs/resource/123456789/7/qdc ResourceSync webapp Item handle Metadata Format metadata.formats = qdc = http://purl.org/dc/terms/, mets = http://www.loc.gov/METS/ metadata.types = qdc = application/xml, mets = application/xml <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc<loc> <rs:md type="application/xml”/> <rs:ln href="http://purl.org/dc/terms/" rel="describedby"/> <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/mets</loc> <rs:md type="application/xml”/> <rs:ln href="http://www.loc.gov/METS/" rel="describedby"/>
  • 154. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Generating Documents 1. Initialise Creates initial Capability List and Resource List documents [dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -i 2. Update Creates a new Change List which covers the period since the last Change List was created [dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -u 3. Rebase A combination of both Initialise and Update. [dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -r 154
  • 155. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Usage of Resources by clients 155
  • 156. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Impact on DSpace 156
  • 157. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada URLs • Stable identifiers for archived items • Stable identifiers for unarchived items • Stable identifiers for metadata resources (in their various formats) • Stable identifiers for previous versions Provenance • History of changes to an item/bitstream • Item/bitstream deletions (vs withdraw) • Bitstream create/update dates • Item create/update dates 157 ?
  • 158. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Versioning • Access of previous versions of both metadata and bitstreams • Stable identifiers for previous versions of both metadata and bitstreams Metadata Resources • Metadata in a variety of formats • Metadata as file/bitstream 158 ? ?
  • 159. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Admin Files • ResourceSync documents (Resource Lists, Change Lists, etc) • ResourceSync exports - Resource Dumps, Change Dumps • Metadata exports in a number of formats Scheduled Tasks • Regular generation of RS documents Complex Objects • Item/bitstream relationships • Collections of content 159
  • 160. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Dspace Module: https://github.com/CottageLabs/DSpaceResourceSync depends on the common java library: https://github.com/CottageLabs/ResourceSyncJava PHP client: https://github.com/stuartlewis/resync-php depends on the SWORDv2 clienbt library: https://github.com/swordapp/swordappv2-php-library/ Get the software! 160
  • 161. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Implementation #2: ResourceSync at arXiv.org 161
  • 162. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync @ arXiv • Use ResourceSync for both mirroring and public data access o efficient updates o ability to do periodic audits o public synchronization capability o reduce admin burden • Likely start with metadata + source for mirroring use case (doing experiments now) • Open access use cases requires processed PDF also • Some concerns about likely use/load… 162
  • 163. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada 163
  • 164. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Alternate download location • Likely want to separate machine accesses from human accesses to preserve response time on main server => Use Mirrored Content part of spec o <loc> specifies canonical URI - e.g. http://arxiv.org/pdf/1306.1073v1.pdf o <rs:ln rel=“duplicate”> specifies preferred download location - e.g. http://export.arxiv.org/pdf/1306.1073v1.pdf 164
  • 165. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada <url> <loc>http://arxiv.org/pdf/1306.1073v1.pdf</loc> <lastmod>2013-06-06T00:57:12Z</lastmod> <rs:md hash="md5:e08e0c4e4d7b0895120014f0aa09e7c4" length="287714” type=”application/pdf"/> <rs:ln rel="duplicate” pri="1" href="http://export.arxiv.org/pdf/1306.1073v1.pdf" modified="2013-06-06T02:00:59Z"/> </url> Alternate download location 165
  • 166. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Getting a copy of arXiv It might be as easy as: 166 (of course, you probably have to wait a while but it is nice to know ResourceSync is stateless so one can efficiently restart)
  • 167. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Python Library and Client • Aim to provide library code implementing all ResourceSync facilities for use in both source and destination implementations o Designed for python 2.6 (RHEL6) and 2.7 o Will not work with python <= 2.5 • Client (resync) supports many destination operations, inspired by the common Unix rsync program • Client also supports some operations that might be useful in a source, such as generation of static Resource Lists, or periodic Change Lists (used in arXiv experiments) • Explorer (resync-explorer) intended to allow easy inspection of a source’s resource sets and capabilities • Developed since ResourceSync v0.5, updated for v0.9 http://github.org/resync/resync
  • 168. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync Source Simulator • Python code using Tornado server • Provides random set of resources of different sizes updated at a particular rate • Very useful for testing Destination code http://github.com/resync/simulator
  • 169. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync - Agenda 6. Q&A 169
  • 170. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada 170
  • 171. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada 171
  • 172. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada 172
  • 173. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Timeline • June 2013 o Version 0.9 of ResourceSync framework specification released o Soliciting broad feedback • July 2013 o Version 0.x of Push-based methods for ResourceSync • Fall 2013 o Specification becomes NISO standard 173
  • 174. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada Pointers • Specification http://www.openarchives.org/rs/ http://www.openarchives.org/rs/0.9/resourcesync http://www.openarchives.org/rs/0.9/archives • List for public comment https://groups.google.com/d/forum/resourcesync • Client and simulator code http://github.org/resync/resync http://github.org/resync/simulator 174
  • 175. ResourceSync Tutorial July 8, 2013, Open Repositories 2013, PEI, Canada ResourceSync: A Web-Based Resource Synchronization Framework ResourceSync is funded by The Sloan Foundation & JISC #resourcesync 175

Editor's Notes

  1. LANL Memento Aggregator of IIPC; Europeana does metadata via OAI-PMH but anticipate content also; arXiv – mirroring and data sharing; Linked data @ BBC; DBpedia, journal data at LANLREST not about in 1999
  2. XML &lt;-&gt; OAI-PMHlarge data begs diff question
  3. protected mostly about existing HTTP auth methods, stats -&gt; just inventory
  4. Switching to a standardized resource-centric framework could
  5. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  6. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  7. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  8. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  9. Top line – just metadata about resources, destination uses GET to get them (duh)Bottom line – packaged content =&gt; fewer round trips
  10. Rsyncetc just reference; push vs pull -&gt; both; many other parts
  11. Rsyncetc just reference; push vs pull -&gt; both; many other parts
  12. They have in common: versions exist at different URIs. Because only the representation of a single state of a resource is available from a URI.
  13. They have in common: versions exist at different URIs. Because only the representation of a single state of a resource is available from a URI.
  14. Pattern exists in e.g.: WikiPedia, W3C specs, DryadNot sure whether DOI in general follows this paradigm.
  15. Now the question is “How we do access those versions” - Can interlink them. There’s RFCs that describe how to do that.-But that URI-R is special. It is what typically is being bookmarked, put in email. Want to leverage the fact that this URI-R is always there. Use it as the entry point.
  16. Memento addresses the problem in a resource-centric way:Resource, URI, state, representation, link, content negotiation
  17. Test site, has subsets of arXiv and even complete source plus metadata (at present not up to date with 0.9)
  18. No way around the difficulty of transferring 1TB initially but then a daily or weekly sync is efficient, and it still works even after some arbitrary time.
  19. Email and phone discussions over the past few months. Knock-down drag-out two day meeting after JCDL in DC in June.
  20. Email and phone discussions over the past few months. Knock-down drag-out two day meeting after JCDL in DC in June.