This document summarizes a Linked Data workshop held at the Wellcome Institute on August 14, 2014. The workshop covered topics related to using Linked Data principles and RDF to describe archival resources and connect different datasets. Attendees learned about entities, identifiers, triples, vocabularies, and tools like OpenRefine for matching and reconciling names across datasets. The goal is to tell stories for both humans and computers by linking related people, places, events and other concepts across archives and other cultural heritage sources on the web.
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Linked dataworkshopintro14aug2014
1. Linked Data: a practical approach
Wellcome Institute, 14 August 2014
Adrian Stevenson and Jane Stevenson
“Linked Data is Storytelling for computers. It doesn’t have the full richness,
complexity and nuance that we invest in our narratives, but it does at least
help computers to fit all the bits together in meaningful ways.”
2. Linked Data workshop
• Entities and Identities
• Documents and Data
• URIs and Connections
• Triples
• Data Creation
• RDF Graphs and the Archives Hub Graph
• Vocabularies
• Locah: our experience of creating RDF
• Connecting datasets
• Demonstration websites
• Name Matching and Demo of Open Refine
• Linking Lives interface
• Calm and Linked Data
• Round up and close
5. Martha Beatrice Webb, 1858-1943, social reformer
is the creator of some archive collections
6. Each of these is about
an archive collection
Each of these is a
document
7. Each document has
lots of useful
information
Each is formatted so
a human reader can
understand it
But let’s give each
document an
identifier that
works on the Web…
The Web works
with http://
12. Now we can make the statement:
<creator-of>
http://data.archiveshub.ac.uk/id/archivalresource/
gb394-we
http://data.archiveshub.ac.uk/id/person/nra/
webbmarthabeatrice1858-1943socialreformer
Martha Beatrice Webb
is the creator of the archive
Beatrice Webb: A summer
holiday in Scotland, 1884
…identifiers for the Web (for a machine) …labels for humans
13. <creator-of> is the creator of the archive
George Bernard Shaw Diaries
…identifiers for the Web (for a machine) …labels for humans
George Bernard Shaw,
1859-1950, playwright
http://archiveshub.ac.uk/id/archivalresource/gb
0097sr0293
http://data.archiveshub.ac.uk/id/person/ncarules/
shawgeorgebernard1856-
1950irishdramatistcriticandnovelist
14. We can start to say things about relationships…
http://data.archiveshub.ac.uk/id/person/nra/webb
marthabeatrice1858-1943socialreformer
<knew>
http://data.archiveshub.ac.uk/id/person/ncarules/sha
wgeorgebernard1856-
1950irishdramatistcriticandnovelist
15. We can start to say things that go beyond what is known within our
own space…
http://data.archiveshub.ac.uk/id/person/nra/webb
marthabeatrice1858-1943socialreformer
<is the same as>
http://viaf.org/viaf/86607236/
16. We can start to find different sources about the same person…
<is the same as>
http://viaf.org/viaf/121884166/
http://data.archiveshub.ac.uk/id/person/ncarules/shawgeorgeb
ernard1856-1950irishdramatistcriticandnovelist
http://dbpedia.org/page/George_Bernard_Shaw
<is the same as>
17. We can put these ideas together…
http://data.archiveshub.ac.uk/id/person/nra/webbma
rthabeatrice1858-1943socialreformer
<knew>
http://data.archiveshub.ac.uk/id/person/ncarules/sha
wgeorgebernard1856-
1950irishdramatistcriticandnovelist
http://dbpedia.org/page/George_Bernard_Shaw
<also known as>
22. Archival Resource
biographical
history
has
Beatrice Webb (1858-1943), nee Potter,
social reformer and diarist. Married to
Sidney Webb, pioneers of social science. She
was involved in many spheres of political
and social activity including the Labour Party,
Fabianism, social observation, investigations
into poverty, development of socialism, the
foundation of the National Health Service
and post war welfare state, the London
School of Economics, and the New
Statesman.
has
http://archiveshub.ac.uk/d
ata/gb227msda865.w4
27. “You share vocabularies, so that other people (and computers) know when you’re
talking about the same sorts of things. You share identifiers, so that other people (and
computers) know that you’re talking about a specific person, place, object or
whatever.”
Tim Sherratt, Web Developer and Digital Historian, Australia
30. Archival Resource
biographical
history
has
Beatrice Webb (1858-1943), nee Potter,
social reformer and diarist. Married to
Sidney Webb, pioneers of social science. She
was involved in many spheres of political
and social activity including the Labour Party,
Fabianism, social observation, investigations
into poverty, development of socialism, the
foundation of the National Health Service
and post war welfare state, the London
School of Economics, and the New
Statesman.
archiveshub:hasBio
graphicalHistory
http://data.archiveshub.ac.uk/i
d/archivalresource/gb394-we
33. Linking Datasets
• If something is identified, it can be linked to
• We can then take items from one dataset and link
them to items from other datasets
BBC
VIAF
DBPedia Archives
Hub
Copac
GeoNames
34. “Humans, presented with pieces of information about
people, put things into the form of a story.” (Edward Ayers)
“even isolated and inert pieces of evidence – a list,
a letter, a map, a picture – can assume new and
unimagined meanings when placed in juxtaposition
with other fragments.” (Edward Ayers)
47. Matching Tools
• LOD Refine
• http://code.zemanta.com/sparkica/download.html
• SILK Framework
• http://wifo5-03.informatik.uni-
mannheim.de/bizer/silk/#workbench
• Module 3 at http://euclid-project.eu/ good for use of
Open Refine and SILK
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
47
48. LOD Refine
• Install files available from:
– Mac:
• http://data.archiveshub.ac.uk/workshops/wellcome2014/M
ac.zip
– Windows:
• http://data.archiveshub.ac.uk/workshops/wellcome2014/Wi
ndows.zip
– Direct:
• http://code.zemanta.com/sparkica/download.html
• Install LOD Refine, run it and then in a web
browser go to http://localhost:3333/
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
48
49. LOD Refine
• Download example matching file from:
– http://data.archiveshub.ac.uk/workshops/wellco
me2014/Matching_Sample.csv
– In LOD Refine go to ‘Create Project’ and import
the Matching_Sample.csv data.
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
49
50. Name Concatenation
• To concat the FamilyName, GivenName and
Dates:
• Add new column:
– Click on left down of ‘?Dates’ and select ‘Edit
Column’ > ‘Add Column Based on this Column’
– Name the new column, e.g. ‘ConcatName’
– Use the following GREL expression:
• cells["?FamilyName"].value + ", " + cells["?GivenName"].value + ",
" + cells["?Dates"].value
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
50
52. Reconcile to VIAF
• Info on Roderick Page’s VIAF reconciliation service at:
– http://iphylo.blogspot.co.uk/2013/04/reconciling-author-
names-using-open.html
• Add the VIAF reconciliation service by clicking on
Concat column down arrow and select ‘Reconcile’ >
‘Start reconciling’
• Add the URI for VIAF reconciliation service:
– http://iphylo.org/~rpage/phyloinformatics/services/reconciliatio
n_viaf.php
• Start Reconciling!
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
52
54. VIAF Reconciliation
• Facet the reconcil results by judgement
• Confirm the matched and unmatched data as
required
• Possibly create another column for e.g SKOS
close matches or ‘isLikes’
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
54
55. Create VIAF URI Column
• Select the reconciled column’s dropdown
menu > Edit column > Add column based on
this column
• Give col a name and add the GREL expression:
– "http://viaf.org/viaf/"+cell.recon.match.id
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
55
56. Export the VIAF Triples
• Edit the RDF skeleton to include the columns
to be matched and link using the owl:sameAs
property.
• Check the preview
• Export the RDF as Turtle of RDF/XML as
required.
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
56
58. How we created the tabular data
Workshop resources at
http://data.archiveshub.ac.uk/workshops/
wellcome2014/
58
59. http://wraggelabs.com/shed/presentations/anzsi/
What we need is a data framework that sits beneath the text,
identifying people, dates and places, and defining
relationships between them and our documentary sources. A
framework that computers could understand and interpret, so
that if they saw something they knew was a placename they
could head off and look for other people associated with that
place. Instead of just presenting our research we’d be creating
a whole series of points of connection, discovery and
aggregation. (Tim Sherratt)
…this is the goal of Linked Data.
A name what does this represent? You may know who I am referring to – but how can you be sure? Usually need context.
But if the context is not given, or if it will vary….?
Need to be sure who this is – need more than a name.
This is more like it – unique surely? Identifiable in any context.
Dates and a description.
May seem obvious, but even when creating standard archival description the importance of a unique name entry is not always thought about
OK, so we want to talk about this person in the context of archives. As a creator of archives.
This is a nice clear human-readable statement.
On the Archives Hub are a number of descriptions of collections where Beatrice Webb is the creator.
Descriptions are documents – the Web has largely been about the idea of documents – accessing documents and linking thorugh to other documents through hypertext links.
This has led us to think in terms of an archive description being a document – being presented as a whole, just as it would be pre-internet when it was printed out and available in the search room.
The documents are designed to be easily readable by humans.
But they are on the Web. We need to identify them in this context.
The web identifies things by using the http URI identifiers.
So here are all the identifiers for each of these collections, making them identifiable on the Web. They can be bookmarked, linked to, cited, and they are searchable via Google.
So, the identifier is an integral part of the description.
This identifier follows good practice – uses a clear, short URI that will be unique (should be unique!) because it includes the archive collection reference and the repository code and countrycode.
Back to our name, which is included a number of times within the document.
Thinking along the same lines as for the description itself, we need to identify the person in a web context – so she needs a URI as well.
it starts to get interesting – providing URIs enables statements to be made in a machine-readable way.
Here we have not included the URIs for the properties (that’s another story) but the principle is clear: there is a human readable version of the statement and a machine-readable version.
Statements can be made based on this principle across all the data.
So we get to the heart of the matter – it is all about connections, relationships and context. In isolation the identifier for a person is not useful, but we can start to refer to relationships with other people.
When you talk about linked data, the real power lies in going outside of your own world – if other things have URIs then you can link to them.
Researchers want to bring information together, so that’s what we should strive to do. We should strive to break down barriers between data sources. Documents tend to maintain barriers, or at least provide only narrow gates between them. A hyperlink from one document to another to another….
These connections can be built up, we can start to link data sources in so many ways.
A basic 3-part statement – a triple
A literal, which uses text rather than a URI.
We have four ‘things’ here: unit of description; repostiory; finding aid; EAD document.
We have given Unit of description a number of properties. Other things can also have properties (this is simplified)
These properties are indicated in the green boxes. They are also called predicates.
The Archives Hub RDF model (simplified)
In hypertext web sites it is considered generally rather bad etiquette not to link to related external material. The value of your own information is very much a function of what it links to, as well as the inherent value of the information within the web page. So it is also in the Semantic Web.
Remember, this is about machines linking – machines need identifiers; humans generally know when something is a place or when it is a person.
BBC + DBPedia + GeoNames + Archives Hub + Copac + VIAF = the Web as an exploratory space
We’ve looked at various sources. They don’t all provide the data that you might imagine from looking at the end user interface. It is a learning process to figure out what they do provide and how best to link to them.
VIAF is a key hub for us – our main link is to VIAF through those ‘same as’ links.
Wikipedia (DBPedia) is probably the most popular linked data hub and we’ve drawn data from here – the image in particular.
The RDF view is the view that shows the properties and values that the data provides.
New datasets are being added to the linked data space all the time, and this means the opportunities grow.
National Museum of Australia in Canberra. Can build your own wall with search perameters. Tim Sherratt.