5. hetarchief.be
“News from the Great War”
• Newspapers 1914 - 1918
• 10+ Content Partners
• Begin 2015: site launched
• Functionality
• Search by keyword
• Map with place of publication
• Collections
1k titles
55k newspapers
300k pages
7. Policy
1. Metadata
• No restrictions → CC0
2. OCR, documents
• Pictures, short stories…
• Uncertain copyright status
• No license or “terms of use” that minimises restrictions
for re-use
• Disclaimer
8. hetarchief.be
• One of the biggest databases online
• No raw data?
• Title
• Description → OCR from ALTO
• Date created
• Owner
• IDs (carrier, Abraham, VIAA)
• URL image
10. First 3 Stars
• Open License
• Structured
• Non-proprietary
VIAA
DB
VIAA API NodeJS
→ github.com/viaacode/hetarchief2lod
IDs Metadata
CSV
Transform
11. Step 4: URIs for everything
• Map VIAAs internal ID to URI:
• http://data.viaa.be/noid/{id}
• Use ontologies
• BBC → Creative Work Ontology
• schema.org
• Hydra → collections
14. 5-star: link to other sources
• ABRAHAM: catalogue of newspapers in Belgium
<http://anet.be/record/abraham/opacbnc/c:bnc:26>
<http://data.viaa.be/noid/tm71v5c76q_191510XX>
owl:sameAs
19. Connect with other datasources
• Cfr. Europeana, delpher.nl, lab.kbresearch.nl
20. Stanford NER
• 4 types: Location, Organisation, Person and
Other
• Train your model: golden corpus
• Write code that fits your needs
• SPARQL query that matches strings
• REPERTOIRE des COMMUNES et des PRINCIPAUX
HAMEAUX de la ci-devant Belgique
• Difficult to find cultural APIs (cfr.
InFlandersField list of names, Abraham
catalogue)
21. DBpedia Spotlight
• Proof of concept
• Models for all languages (nl, en, fr, de)
NL/FR/EN/DE
trained model
DBpedia
matcher
Stanford NER
22. Results?
• Filter on OCR quality; e.g. <90% assurance
in ALTO
• Wrong time period, e.g. geonames
• Standard models, should be trained
• Use DBpedia knowledge later to filter
“impossible” tags
23. DBpedia Spotlight
• Running your own endpoint is easy:
• java -Xmx8G -jar dbpedia-
spotlight-0.7.1.jar nl http://localhost:2223/
nl/rest
• Or with Docker:
• docker build -f Dockerfile -t
dutch_spotlight .
• docker run -i -p 2223:80 dutch_spotlight
spotlight.sh
24. Linked Data as a Service
• Allow federated queries
• Low server cost
• Be reliable
• Triple Pattern Fragments: a Low-cost
Knowledge Graph Interface for the Web
25. Linked Data Fragments querying
• VIAA is part of the family!
http://data.viaa.be/
ldfhttps://query.wikidata.org/
bigdata/ldf
http://
data.linkeddatafragments.
org/linkedgeodata
http://
data.linkeddatafragments.
org/dbpedia2014
Your browser
Client-side algorithm
GET fragments
27. Demo
• Retrieve all newspaper titles:
SELECT DISTINCT ?title
WHERE {
?paper <http://www.bbc.co.uk/ontologies/creativework#title> ?title
}
28. Demo
• Retrieve more info from corresponding
DBpedia URI:
SELECT ?label ?comment
WHERE {
<http://data.viaa.be/noid/2z12n51476_19141120_0001> <http://
www.bbc.co.uk/ontologies/creativework#tag> ?tag .
?db owl:sameAs ?tag .
?db rdfs:label ?label .
?db rdfs:comment ?comment
}
29. Battle of the Somme
• Pages with military leaders from the Battle
of the Somme mentioned + thumbnail:
SELECT ?paper ?o ?thumbnail
WHERE {
<http://dbpedia.org/resource/Battle_of_the_Somme> <http://dbpedia.org/
ontology/commander> ?o .
?paper <http://www.bbc.co.uk/ontologies/creativework#tag> ?ctag .
?o owl:sameAs ?ctag .
?o <http://dbpedia.org/ontology/thumbnail> ?thumbnail .
}
30. Frontpainters
• Semi-automatic generation of collections,
e.g. about frontpainters
SELECT ?newspaper ?artist ?tag ?hetarchief
WHERE {
?artist dc:subject <http://dbpedia.org/resource/
Category:Belgian_war_artists> .
?artist owl:sameAs ?tag .
?newspaper <http://www.bbc.co.uk/ontologies/creativework#tag> ?
tag .
?newspaper <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?
hetarchief
}
31. Conclusion
• Extra search method for our researchers
• NER versus OCR: enhanced findability
• Adding extra information (cfr. Abraham)
requires effort, we need more TPFs
interfaces
32. Future work
• Dereferencable URIs
• http://data.viaa.be/noid/{id}
• Content negotiation
• HTML
• JSON
• RDF
• Save location with OLR
• Suggestions are welcome!