SlideShare a Scribd company logo
1 of 26
Download to read offline
Capturing Themed Evidence, a Hybrid Approach
K-Cap 2019
19-21 November 2019
Marina del Rey, California, United States
Enrico Daga and Enrico Motta
The Open University
enrico.daga@open.ac.uk
Motivation
The task of identifying pieces of evidence in texts is of fundamental
importance in supporting qualitative studies in various domains,
especially in the humanities (e.g. historiographic methodology)
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented
Case study: the Listening Experience Database
• An open and freely searchable database that brings together a mass of
data about people’s experiences of listening to music of all kinds, in any
historical period and any culture.
• Sophisticated data model, natively in RDF / SPARQL
• Linked Open Data: http://data.open.ac.uk/context/led
• Since 2012, the LED project has collected over 10,000 unique listening
experiences from a variety of textual sources
https://led.kmi.open.ac.uk/
How to support users on capturing themed evidence?
• We coin the expression themed evidence, to refer to (direct or
indirect) traces of a fact or situation relevant to a theme of interest
and study the problem of identifying them in texts.
• The task of identifying themed evidence is at the intersection between
topical text classification (finding texts relevant to a certain theme) and
event retrieval (find events mentioned in texts).
• Not all topical texts are themed evidence and the nature of the event
itself is often assumed, implicit, and left to the reader
Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the windows,
[. . . ] the colors of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly those of
music, poetry and painting, were especially honored, and floated triumphant
amidst the standards of electorates, dukedoms, and kingdoms.
A Hybrid Approach
• Themed evidence are a subset of topical texts (e.g. about “music”) - distributional
semantics
• Common knowledge graphs include a large amounts of interlinked entities,
including topical entities (in the category “music”) - entity linking to structured
knowledge
• Background knowledge can be used for learning features and tuning elements
of the method - corpus based analysis
• We formalise the task as a binary classification problem; approach in three steps:
1. Statistical relatedness analysis
2. Themed-entity detection
3. Hybridisation phase
Background Knowledge // Listening Experiences
• LE Database includes text excerpts that can be analysed as positive examples.
• Project Gutenberg >58k books in the public domain (48790 en)
• Reuters-21578 (Reu) 21.578 news articles of various categories. It does not
include music.
• The UK Reading Experience Database (UK RED) investigates the evidence of
reading in Britain
• DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL
endpoint and a NER tool: DBpedia Spotlight
1> Statistical Relatedness Analysis
• Compute embeddings (Word2Vec) on Project Gutenberg (1.5B words!),
we develop a domain dictionary of 10k terms related to a core term:
music[n] (1.0) <— core term
melody[n] (7.8010)
guitar[n] (6.8451)
inspiriting[j] (6.3402)
heartful[j] (4.2634)
psalm[n] (4.0559)
…
1> Statistical Relatedness Analysis
0 rontgen[N]
1 play[V]
2 Brahms[N]
3 symphony[N]
4 another[D]
5 musical[J]
6 take[V]
7 always[R]
8 happen[V]
9 specially[R]
10 count[V]
11 something[N]
12 sort[N]
1> Distribution Analysis // Learning the threshold
We analyse the distribution of the dictionary with 1+ (LED) and 2- corpora
(RED and Reu), and calculate both average score x and standard deviation σx
on the positive corpus.
These values partition the corpus in quartiles:
(1) r < (x−σx); (3) (x+σx) < r > (x);
(2) x < r > (x−σx); (4) r > (x + σx).
3 threshold values:
th1 > (x − σx),
th2 > x, and th3 > x + σx.
Scores
Items
1> Statistical relatedness // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
• Anacreontic[n]: 4.13048797627
• amateur[n]: 4.60138704262
• admirably[r]: 3.65226351076
• orchestral[j]: 7.09262661606
• trio[n]: 5.60459207257
• piano[n]: 6.36957273307
1> Statistical relatedness // Example
MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
psalm[n]: 4.05596201177
1> Statistical relatedness // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
harmoniously[r]:4.96754289705
music[n]:1.0
poetry[n]:5.93071678171
painting[n]:4.39244380382
triumphant[j]:3.80869437369
amidst[i]:3.6638322575
Problems
• Named entities may not be sufficiently represented in the dictionary
(e.g."Prélude à l'après-midi d'un faune").
• Many entities may not appear in the trained embeddings.
• Terms may have a low score because not statistically relevant.
• However, the presence of named entities is a clue of possible evidence.
• Distributional approaches alone inherit ambiguity of the core term, for
example, figurative use (sounds good?)
2> Themed entity detection
• DBPedia Spotlight to identify %entities%
• SPARQL query to filter the ones related to
dbcat:Music
• Where %entities% are the resources identified by
the NER engine, and %d% is a parameter, set to 5
(>5 too much noise).
SELECT distinct ?sub WHERE {
VALUES ?sub { %entities% }
?sub dc:subject ?subject .
?subject skos:broader{0:%d%} cat:Music
}
2> Themed-entity detection // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano
2> Themed-entity detection // Example
MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
http://dbpedia.org/resource/Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms
2> Themed-entity detection // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
http://dbpedia.org/resource/Music
3> Hybridisation
Entity boost. To
promote terms mapped
to entities
PoS Filter: demote
terms other then verbs
and nouns, to privilege
factual statements
3> Hybridisation // Examples
• RECMUS-619: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte',
they drew me to the piano
• MASONB-31: In the evening we went to Rev. Baptist Noel's chapel, where
one is always sure of edification from the sermon if not from the psalms.
• MASONB-88: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly
those of music, poetry and painting, were especially honored, and ︎oated
triumphant amidst the standards of electorates, dukedoms, and kingdoms.
http://led.kmi.open.ac.uk/discovery/findler
Evaluation // Gold Standard
• 500 positive samples sourced from 17
books in the LED collection
• 500 negative samples sourced from the
same books
• Negative samples picked with similar length
of each positive (avg length ~125 words)
• Accurate: Fleiss’ kappa reports substantial
agreement among annotators
• Pessimistic: negative samples more similar
to positives then to RED and Reu
• Also a gold standard produced from RED, to
evaluate portability
Both GS published openly for reuse
http://led.kmi.open.ac.uk/discovery/findler
Evaluation // Methods
• Fo: Random Forest Classifier (ML) // trained on LED, RED, and Reu // Test ~80% Acc
• St: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF
(more details in the paper)
• Em: Statistical relatedness component only (Embeddings)
• En: Themed entity detection component (Entity)
• Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered)
• Hy-F: No filter, only entity boost (Hybrid - Unfiltered)
• Hy: Our Hybrid approach
• Hy/R: Our Hybrid approach on the Reading Experience Database (to test
portability). Core concept: book[n] and core entity: dbc:Literature
http://led.kmi.open.ac.uk/discovery/findler
Evaluation // Discussion
Fo: high precision, low recall, accuracy slightly
above random (robust GS)
En alone has a performance slightly above
random: gold standard is pessimistic
Without applying noise correction (POS filter),
precision is generally lower
Hy-F shows the impact of entity detection on
recall
Hy: best of both worlds. Substantial agreement
with annotators (Cohen’s K)
Hy/R: our approach is applicable to other
domains with small configuration
See the paper for more observations and for an analysis
of errors
The results are very good: 87% F-Measure & Accuracy
http://led.kmi.open.ac.uk/discovery/findler
Future work
• Applying the method to the scan of books (FindLEr demo) involves other issues before
classification, incl. segmentation, and clutter (indexes, references, …)
• In absence of a knowledge base of annotated documents, how to learn the parameters -
threshold & default score?
• Experiment with other embeddings techniques (ElMo, BERT), extract multi-words
expressions, and try other entity linkers (Wikifier/Wikidata)
• We performed a concept search, what about a multi-concept search (music & children,
music & war)
• Searching repositories instead of books (some workflow issues here…)
• KE to support the curation of the documentary evidence. See the Sciknow position
paper “Challenging knowledge extraction to support the curation of documentary evidence in
the humanities”
http://led.kmi.open.ac.uk/discovery/findler
Questions?
Feedback:	@enridaga	|	www.enridaga.net

More Related Content

Similar to Capturing Themed Evidence, a Hybrid Approach

Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchKepa J. Rodriguez
 
"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of Hamlet"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of HamletIain Emsley
 
Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)Renata Brandão
 
Denktank 2010
Denktank 2010Denktank 2010
Denktank 2010ocor203
 
Genre Classification and Analysis
Genre Classification and AnalysisGenre Classification and Analysis
Genre Classification and AnalysisAnat Gilboa
 
DPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateDPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateGretchen Gueguen
 
Russianmusicgenre
RussianmusicgenreRussianmusicgenre
Russianmusicgenrepengel1
 
BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19Jerwood Library, Trinity Laban
 
Making the Leap Towards Linked Data
Making the Leap Towards Linked DataMaking the Leap Towards Linked Data
Making the Leap Towards Linked DataIris Lee
 
Discovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-BelfordDiscovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-BelfordNASIG
 
CENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 BurrowsCENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 BurrowsToby Burrows
 
Creating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music VisualizationCreating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music Visualizationicchp2012
 
Malina lafayette 2019 colloquium
Malina lafayette 2019 colloquiumMalina lafayette 2019 colloquium
Malina lafayette 2019 colloquiumroger malina
 
Lovello Concert 8-2.pub (Read-Only)
 Lovello Concert 8-2.pub (Read-Only) Lovello Concert 8-2.pub (Read-Only)
Lovello Concert 8-2.pub (Read-Only)crysatal16
 
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed TextsDatech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed TextsIMPACT Centre of Competence
 

Similar to Capturing Themed Evidence, a Hybrid Approach (20)

Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of Hamlet"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of Hamlet
 
Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)
 
Denktank 2010
Denktank 2010Denktank 2010
Denktank 2010
 
Genre Classification and Analysis
Genre Classification and AnalysisGenre Classification and Analysis
Genre Classification and Analysis
 
DPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateDPLA Archival Description Working Group Update
DPLA Archival Description Working Group Update
 
MIR
MIRMIR
MIR
 
Russianmusicgenre
RussianmusicgenreRussianmusicgenre
Russianmusicgenre
 
BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19
 
Making the Leap Towards Linked Data
Making the Leap Towards Linked DataMaking the Leap Towards Linked Data
Making the Leap Towards Linked Data
 
Discovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-BelfordDiscovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-Belford
 
CENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 BurrowsCENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 Burrows
 
Creating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music VisualizationCreating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music Visualization
 
Malina lafayette 2019 colloquium
Malina lafayette 2019 colloquiumMalina lafayette 2019 colloquium
Malina lafayette 2019 colloquium
 
Lovello Concert 8-2.pub (Read-Only)
 Lovello Concert 8-2.pub (Read-Only) Lovello Concert 8-2.pub (Read-Only)
Lovello Concert 8-2.pub (Read-Only)
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Art resources
Art resourcesArt resources
Art resources
 
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed TextsDatech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
 
On the two sides of the pond
On the two sides of the pondOn the two sides of the pond
On the two sides of the pond
 
Sparse and Low Rank Representations in Music Signal Analysis
 Sparse and Low Rank Representations in Music Signal  Analysis Sparse and Low Rank Representations in Music Signal  Analysis
Sparse and Low Rank Representations in Music Signal Analysis
 

More from Enrico Daga

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyEnrico Daga
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...Enrico Daga
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEIEnrico Daga
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything projectEnrico Daga
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Enrico Daga
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Enrico Daga
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterEnrico Daga
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User StudyEnrico Daga
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so farEnrico Daga
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsEnrico Daga
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionEnrico Daga
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsEnrico Daga
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 

More from Enrico Daga (17)

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 

Recently uploaded

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Capturing Themed Evidence, a Hybrid Approach

  • 1. Capturing Themed Evidence, a Hybrid Approach K-Cap 2019 19-21 November 2019 Marina del Rey, California, United States Enrico Daga and Enrico Motta The Open University enrico.daga@open.ac.uk
  • 2. Motivation The task of identifying pieces of evidence in texts is of fundamental importance in supporting qualitative studies in various domains, especially in the humanities (e.g. historiographic methodology) Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is prone to errors, and (d) the methodology is (often) not documented
  • 3. Case study: the Listening Experience Database • An open and freely searchable database that brings together a mass of data about people’s experiences of listening to music of all kinds, in any historical period and any culture. • Sophisticated data model, natively in RDF / SPARQL • Linked Open Data: http://data.open.ac.uk/context/led • Since 2012, the LED project has collected over 10,000 unique listening experiences from a variety of textual sources https://led.kmi.open.ac.uk/
  • 4. How to support users on capturing themed evidence? • We coin the expression themed evidence, to refer to (direct or indirect) traces of a fact or situation relevant to a theme of interest and study the problem of identifying them in texts. • The task of identifying themed evidence is at the intersection between topical text classification (finding texts relevant to a certain theme) and event retrieval (find events mentioned in texts). • Not all topical texts are themed evidence and the nature of the event itself is often assumed, implicit, and left to the reader
  • 5. Finding Listening Experiences (theme: music) • RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to the piano. • MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel, where one is always sure of edification from the sermon if not from the psalms. • MASONB-88, negative: Flags and pendants were suspended from the windows, [. . . ] the colors of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and floated triumphant amidst the standards of electorates, dukedoms, and kingdoms.
  • 6. A Hybrid Approach • Themed evidence are a subset of topical texts (e.g. about “music”) - distributional semantics • Common knowledge graphs include a large amounts of interlinked entities, including topical entities (in the category “music”) - entity linking to structured knowledge • Background knowledge can be used for learning features and tuning elements of the method - corpus based analysis • We formalise the task as a binary classification problem; approach in three steps: 1. Statistical relatedness analysis 2. Themed-entity detection 3. Hybridisation phase
  • 7. Background Knowledge // Listening Experiences • LE Database includes text excerpts that can be analysed as positive examples. • Project Gutenberg >58k books in the public domain (48790 en) • Reuters-21578 (Reu) 21.578 news articles of various categories. It does not include music. • The UK Reading Experience Database (UK RED) investigates the evidence of reading in Britain • DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL endpoint and a NER tool: DBpedia Spotlight
  • 8. 1> Statistical Relatedness Analysis • Compute embeddings (Word2Vec) on Project Gutenberg (1.5B words!), we develop a domain dictionary of 10k terms related to a core term: music[n] (1.0) <— core term melody[n] (7.8010) guitar[n] (6.8451) inspiriting[j] (6.3402) heartful[j] (4.2634) psalm[n] (4.0559) …
  • 9. 1> Statistical Relatedness Analysis 0 rontgen[N] 1 play[V] 2 Brahms[N] 3 symphony[N] 4 another[D] 5 musical[J] 6 take[V] 7 always[R] 8 happen[V] 9 specially[R] 10 count[V] 11 something[N] 12 sort[N]
  • 10. 1> Distribution Analysis // Learning the threshold We analyse the distribution of the dictionary with 1+ (LED) and 2- corpora (RED and Reu), and calculate both average score x and standard deviation σx on the positive corpus. These values partition the corpus in quartiles: (1) r < (x−σx); (3) (x+σx) < r > (x); (2) x < r > (x−σx); (4) r > (x + σx). 3 threshold values: th1 > (x − σx), th2 > x, and th3 > x + σx. Scores Items
  • 11. 1> Statistical relatedness // Example RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano. • Anacreontic[n]: 4.13048797627 • amateur[n]: 4.60138704262 • admirably[r]: 3.65226351076 • orchestral[j]: 7.09262661606 • trio[n]: 5.60459207257 • piano[n]: 6.36957273307
  • 12. 1> Statistical relatedness // Example MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. psalm[n]: 4.05596201177
  • 13. 1> Statistical relatedness // Example MASONB-88, negative: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms. harmoniously[r]:4.96754289705 music[n]:1.0 poetry[n]:5.93071678171 painting[n]:4.39244380382 triumphant[j]:3.80869437369 amidst[i]:3.6638322575
  • 14. Problems • Named entities may not be sufficiently represented in the dictionary (e.g."Prélude à l'après-midi d'un faune"). • Many entities may not appear in the trained embeddings. • Terms may have a low score because not statistically relevant. • However, the presence of named entities is a clue of possible evidence. • Distributional approaches alone inherit ambiguity of the core term, for example, figurative use (sounds good?)
  • 15. 2> Themed entity detection • DBPedia Spotlight to identify %entities% • SPARQL query to filter the ones related to dbcat:Music • Where %entities% are the resources identified by the NER engine, and %d% is a parameter, set to 5 (>5 too much noise). SELECT distinct ?sub WHERE { VALUES ?sub { %entities% } ?sub dc:subject ?subject . ?subject skos:broader{0:%d%} cat:Music }
  • 16. 2> Themed-entity detection // Example RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano. http://dbpedia.org/resource/Anacreontic_Society http://dbpedia.org/resource/Orchestra http://dbpedia.org/resource/Trio_(music) http://dbpedia.org/resource/Così_fan_tutte http://dbpedia.org/resource/Piano
  • 17. 2> Themed-entity detection // Example MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. http://dbpedia.org/resource/Evening_Prayer_(Anglican) http://dbpedia.org/resource/Psalms
  • 18. 2> Themed-entity detection // Example MASONB-88, negative: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms. http://dbpedia.org/resource/Music
  • 19. 3> Hybridisation Entity boost. To promote terms mapped to entities PoS Filter: demote terms other then verbs and nouns, to privilege factual statements
  • 20. 3> Hybridisation // Examples • RECMUS-619: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano • MASONB-31: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. • MASONB-88: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms.
  • 22. Evaluation // Gold Standard • 500 positive samples sourced from 17 books in the LED collection • 500 negative samples sourced from the same books • Negative samples picked with similar length of each positive (avg length ~125 words) • Accurate: Fleiss’ kappa reports substantial agreement among annotators • Pessimistic: negative samples more similar to positives then to RED and Reu • Also a gold standard produced from RED, to evaluate portability Both GS published openly for reuse http://led.kmi.open.ac.uk/discovery/findler
  • 23. Evaluation // Methods • Fo: Random Forest Classifier (ML) // trained on LED, RED, and Reu // Test ~80% Acc • St: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF (more details in the paper) • Em: Statistical relatedness component only (Embeddings) • En: Themed entity detection component (Entity) • Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered) • Hy-F: No filter, only entity boost (Hybrid - Unfiltered) • Hy: Our Hybrid approach • Hy/R: Our Hybrid approach on the Reading Experience Database (to test portability). Core concept: book[n] and core entity: dbc:Literature http://led.kmi.open.ac.uk/discovery/findler
  • 24. Evaluation // Discussion Fo: high precision, low recall, accuracy slightly above random (robust GS) En alone has a performance slightly above random: gold standard is pessimistic Without applying noise correction (POS filter), precision is generally lower Hy-F shows the impact of entity detection on recall Hy: best of both worlds. Substantial agreement with annotators (Cohen’s K) Hy/R: our approach is applicable to other domains with small configuration See the paper for more observations and for an analysis of errors The results are very good: 87% F-Measure & Accuracy http://led.kmi.open.ac.uk/discovery/findler
  • 25. Future work • Applying the method to the scan of books (FindLEr demo) involves other issues before classification, incl. segmentation, and clutter (indexes, references, …) • In absence of a knowledge base of annotated documents, how to learn the parameters - threshold & default score? • Experiment with other embeddings techniques (ElMo, BERT), extract multi-words expressions, and try other entity linkers (Wikifier/Wikidata) • We performed a concept search, what about a multi-concept search (music & children, music & war) • Searching repositories instead of books (some workflow issues here…) • KE to support the curation of the documentary evidence. See the Sciknow position paper “Challenging knowledge extraction to support the curation of documentary evidence in the humanities” http://led.kmi.open.ac.uk/discovery/findler