The identification and cataloguing of documentary evidence is an important part of empirical research in the humanities.
An increasing number of recent initiatives in the digital humanities have as a primary objective the curation of collections of digital artefacts augmented with fine-grained metadata, for example, mentioning the entities and their relations, often adopting the "Linked Data" paradigm. This talk is focused on exploring the potential of Linked Data to support humanities scholars in identifying, collecting, and curating documentary evidence. First, I will introduce the basic notions around Linked Data and place its emergence in the tradition of Knowledge Representation, an area of Artificial Intelligence (AI). Second, I will show how Linked Data and AI techniques have been successfully applied in the Listening Experience Database project to support the retrieval and curation of documentary evidence. Finally, I will conclude the presentation by discussing the potential (and challenges) of adopting a "knowledge extraction" paradigm to automate the identification and cataloguing of metadata about documentary evidence in texts.
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
Linked data for knowledge curation in humanities research
1. Linked data for knowledge curation in
humanities research
Enrico Daga
Research Fellow, Knowledge Media Institute, The Open University
14th January 2020, Lancaster University / History Dept.
enrico.daga@open.ac.uk - @enridaga
4. Invented the web in 1989
(yeah!)
Invented the semantic web
in 1994 (duh?)
5. “To a computer, then, the web is a flat, boring
world devoid of meaning”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
6. “This is a pity, as in fact documents on the web
describe real objects and imaginary concepts,
and give particular relationships between them”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
7. “Adding semantics to the web involves two things:
allowing documents which have information in
machine-readable forms, and allowing links to be
created with relationship values.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
8. “The Semantic Web is not a separate Web but an
extension of the current one, in which information is
given well-defined meaning, better enabling
computers and people to work in cooperation.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
11. This did not come out of the blue
World’s academic communities has been dealing for years with knowledge
representation
Artificial intelligence, natural language processing, model management,
and many other research fields largely contributed
Some ancestors traced the way …
12. EXAMPLE
• Instances are associated with one or several classes:
Boddingtons rdf:type Ale .
Grafentrunk rdf:type Bock .
Hoegaarden rdf:type White .
Jever rdf:type Pilsner .
Ale rdfs:subClassOf TopFermentedBeer .
White rdfs:subClassOf TopFermentedBeer .
TopFermentedBeer rdfs:subClassOf Beer .
Bock rdfs:subClassOf BottomFermentedBeer .
rdfs:subClassOf rdf:type owl:TransitiveProperty .
15. Ontologies, different types of
Domain independent: SKOS, OWL, Prov, Time, …
Foundational, general purpose:
• DOLCE, SUMO (“Upper Ontologies”)
• CIDOC-CRM: broad scope, targets “cultural heritage” in general
Pragmatic, community-oriented:
• Dublin Core Metadata Initiative
• Google’s schema.org
• https://linked.art/
• Humanities forums: LinkedPasts series, WHiSe Workshops
https://lov.linkeddata.es/dataset/lov
16. Linked Data in a nutshell
hCps://en.wikipedia.org/wiki/Linked_data
Linked Data is a way of publishing structured information that allows data
to be connected and enriched by means of links among their entities.
• LD uses the World Wide Web as publishing platform
• LD is based on basic Web standards (URIs, HTTPs, RDF)
• open to everyone
• LD enables the adoption of shared schemas (Ontologies)
• LD makes the data self-explanatory and self-documented
• LD enables your data to refer to other data
• … and other data to refer to yours!
17. Linked Open Data Cloud in 2007
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
18. Linked Open Data Cloud in 2010
2010 - The OU launches the data.open.ac.uk
Linked Open Data portal, the first of its kind in the UK
19. The OU Open Knowledge Graph
http://data.open.ac.uk
20. Linked Open Data Cloud in 2014
Crawlable
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
24. How this wealth of data can support the
retrieval of documentary evidence?
The identification and cataloguing of documentary evidence from textual
corpora is an important part of empirical research based on
historiographical methodology.
25. The Listening Experience Database
• An open and freely searchable database
that brings together a mass of data
about people’s experiences of listening
to music of all kinds, in any historical
period and any culture.
• Sophisticated data model, natively in RDF
• Linked Open Data:
http://data.open.ac.uk/context/led
• Since 2012, the LED project has collected
over 10,000 unique listening experiences
from a variety of textual sources
https://led.kmi.open.ac.uk/
26. Problem: humanists coin new concepts!
• Traditional AI research is focused on common sense notions
• keyword & topic based information retrieval (documents related to
“Science” or “Music”)
• events as declared statements (e.g. U.S. based attacked by Iran missiles)
• Problem: humanities databases are built on novel concepts, e.g.
• Listening experience (LED Project)
• Reading experience (EU funded READ-IT project)
• Sitting Experience (DH/Arts History PhD at the OU)
27. Manual workflow
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented
How to help scholars on finding a piece of evidence in a text?
28. How to detect concepts beyond keywords?
We coin the expression themed evidence, to refer to (direct or indirect) traces of a
fact or situation relevant to a theme of interest and study the problem of
identifying them in texts.
The task of identifying themed evidence is at the intersection between topical text
classification (finding texts relevant to a certain theme) and event retrieval (find
events mentioned in texts).
Not all topical texts are themed evidence and the nature of the event itself is often
assumed, implicit, and left to the reader
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th
International Conference on Knowledge Capture, pp. 93-100. 2019.
29. Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the windows,
[. . . ] the colors of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly those of
music, poetry and painting, were especially honored, and floated triumphant
amidst the standards of electorates, dukedoms, and kingdoms.
30. A Hybrid Approach
• Themed evidence are a subset of topical texts (e.g. about “music”) - distributional semantics
• Common knowledge graphs include a large amounts of interlinked entities, including topical
entities (in the category “music”) - entity linking to structured knowledge
• Background knowledge can be used for learning features and tuning elements of the method -
corpus based analysis
• LE Database includes text excerpts that can be analysed as positive examples.
• Project Gutenberg >58k books in the public domain (48790 en)
• DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL endpoint and a
NER tool: DBpedia Spotlight
• We formalise the task as a binary classification problem; approach in three steps:
1. Statistical relatedness analysis -> From a Key Terms (e.g. “Music”)
2. Themed-entity detection -> About a key subject (e.g. dbpedia:Music)
3. Hybridisation phase
32. Statistical relatedness // Example
RECMUS-619, positive: Introduced to the Anacreontic Society,
consisting of amateurs who perform admirably the best
orchestral works. The usual supper followed. After propitiating me
with a trio from 'Cosi Fan Tutte', they drew me to the piano.
• Anacreontic[n]: 4.13048797627
• amateur[n]: 4.60138704262
• admirably[r]: 3.65226351076
• orchestral[j]: 7.09262661606
• trio[n]: 5.60459207257
• piano[n]: 6.36957273307
Correct
33. Statistical relatedness // Example
MASONB-31, positive: In the evening we went to Rev. Baptist
Noel's chapel, where one is always sure of edification from the
sermon if not from the psalms.
psalm[n]: 4.05596201177
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th
International Conference on Knowledge Capture, pp. 93-100. 2019.
Wrong
34. Statistical relatedness // Example
MASONB-88, negative: Flags and pendants were suspended from the windows,
[...] the colours of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly those of
music, poetry and painting, were especially honored, and ︎oated triumphant
amidst the standards of electorates, dukedoms, and kingdoms.
harmoniously[r]:4.96754289705
music[n]:1.0
poetry[n]:5.93071678171
painting[n]:4.39244380382
triumphant[j]:3.80869437369
amidst[i]:3.6638322575
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th
International Conference on Knowledge Capture, pp. 93-100. 2019.
Wrong
35. 2> Themed entity detection
• DBPedia Spotlight to identify %entities%
• SPARQL query to filter the ones related to
dbcat:Music
• Where %entities% are the resources identified by
the NER engine, and %d% is a parameter, set to 5
(>5 too much noise).
SELECT distinct ?sub WHERE {
VALUES ?sub { %entities% }
?sub dc:subject ?subject .
?subject skos:broader{0:%d%} cat:Music
}
36. 3> Hybridisation
Entity boost. To
promote terms mapped
to entities
PoS Filter: demote
terms other then verbs
and nouns, to privilege
factual statements
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th
International Conference on Knowledge Capture, pp. 93-100. 2019.
37. Hybrid Approach // Example
RECMUS-619, positive: Introduced to the Anacreontic Society,
consisting of amateurs who perform admirably the best
orchestral works. The usual supper followed. After propitiating me
with a trio from 'Cosi Fan Tutte', they drew me to the piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano Correct
38. Hybrid Approach // Example
MASONB-31, positive: In the evening we went to Rev. Baptist
Noel's chapel, where one is always sure of edification from the
sermon if not from the psalms.
http://dbpedia.org/resource/
Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th
International Conference on Knowledge Capture, pp. 93-100. 2019.
Correct
39. Hybrid Approach // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
http://dbpedia.org/resource/Music
Correct
41. What about supporting curation?
How to support users in cataloguing the documentary evidence?
How to detect the entities and their relationships in the sources?
How to automatically populate the database with metadata?
42.
43. Knowledge Extraction (KE)
• Bet: metadata curation could be supported with KE methods
• KE: automatic or semi-automatic derivation of formal symbolic knowledge from
unstructured or semi-structured sources
• Approaches in the literature vary in task / scope:
• (Named) Entity Recognition and Classification (Person, Work, Time, Place,
…)
• Entity Linking (DBpedia, Gazetteers)
• Relation Extraction (listener of, in place)
• Event extraction (Performance)
• Machine reading
44. Example #1
"I then went to Amsterdam to conduct Oedipus at the
Concertgebouw, which was celebrating its fortieth
anniversary by a series of sumptuous musical
productions. The fine Concertgebouw orchestra, always
at the same high level, the magnificent male choruses
from the Royal Apollo Society, soloists of the first rank -
among them Mme Hélène Sadoven as Jocasta, Louis van
Tulder as Oedipus, and Paul Huf, an excellent reader -
and the way in which my work was received by the public,
have left a particularly precious memory that I recall with
much enjoyment."
listener: Igor Strawinsky
time: in the beginning of 1928
place: Amsterdam
opera: Oedipus Rex
/by: Igor Strawinsky
performer: Concertgebouw orch.
environment: Public
Igor Stravinksy
An Autobiography (1936), p. 139.
https://led.kmi.open.ac.uk/entity/lexp/1435674909834
45. Example #2
"Music is certainly a pleasure that may be
reckoned intellectual, and we shall never
again have it in the perfection it is this
year, because Mr. Handel will not
compose any more! Oratorios begin next
week, to my great joy, for they are the
highest entertainment to me."
listener: Mrs Delany
time: March, 1737
place: London
opera: Operas and Oratorios
/by: G. F. Handel
environment: Public
From: Mary Granville, and Augusta Hall (ed.),
Autobiography and Correspondence of Mary Granville, Mrs
Delany: with interesting Reminiscences of King George the
Third and Queen Charlotte, volume 1 (London, 1861), p.
594.
https://led.kmi.open.ac.uk/entity/lexp/1444424772006
Feedback: @enridaga | www.enridaga.net
46. Analysis: detect the Listener & Place of a LE
• Q1 - in the excerpt? The place is mentioned in the
excerpt in 25.9% cases. The listener only in 13.4%.
• Q2 - near the excerpt? Only 10% of the times the
place mention is less than 5 paragraphs from the
excerpt. The agent, in 4% of the cases.
• Q3 - in the source? 83.2% of the times the place is
mentioned at least once in the source. In 11.4%
the place hasn’t been found.
• Q4 - in the meta? 64.8% of the listeners are also
the authors of the text - 5874 cases in LED.
Distance of entity (in n of paragraphs)
47. Open problems
• Implicit information, based on inference requiring expertise (e.g. Mr
Handel is G.F Handel, Oedipus is “Oedipus Rex”)
• The role of contextual knowledge is fundamental (1) in identifying
the agent from the metadata of the source; (2) common sense
inference (“in the beginning of 1928”)
• Entities can exist in distributed, heterogeneous resources
(encyclopaedic KBs, domain-specific taxonomies, gazetteers, …)
• Cultural studies typically coin novel concepts (ListeningExperience)
with original schemas. Portability of the methods is even more at risk!
Daga, E and Motta, E. "Challenging knowledge extraction to support the curation of documentary evidence in the humanities." (2019).
48. Summary
• Linked Data transforms the way information is shared on the Web
• but also enable opportunities to apply AI techniques to more
applications domains
• supporting users in finding and curating documentary evidence is an
important and difficult task
• finding complex concepts in texts is (more) possible then before,
although most of these techniques have not been applied at scale yet
• traditional AI research is challenged by the richness and diversity of use
cases in the humanities, especially considering the knowledge extraction
49. WHiSe 3
Call for papers!
3rd Workshop on Humanities in
the Semantic Web (WHiSe)
Co-located with the 15th Extended
Semantic Web Conference (ESWC 2020)
Heraklion, Crete, Greece
31/05 or 31/06, 2020 (TBD)
Submission deadline:
28th February
http://whise.cc/
https://commons.wikimedia.org/wiki/File:Edward_Burne-Jones_-_Tile_Design_-
_Theseus_and_the_Minotaur_in_the_Labyrinth_-_Google_Art_Project.jpg
50. PhD position open soon
Title: “Distributed Linked Data for Cultural Heritage”
The aim of this project is to research and develop distributed, Linked Data systems
that enable cultural content to be shared between museums and the public. This may
include innovative ways of publishing digital artworks and related resources by
memory institutions as well as enabling the public to share their own experiences of
visiting and engaging with cultural heritage. The PhD will benefit from being closely
connected with the EU funded SPICE project [1] which is developing methods and
tools to allow citizen groups to actively participate with museums internationally.
[1] http://kmi.open.ac.uk/projects/name/spice