SlideShare a Scribd company logo
1 of 23
Linking historical ship records to 
a newspaper archive 
Andrea Bravo Balado 
Victor de Boer, Guus Schreiber 
VU University Amsterdam
Context: dutchshipsandsailors.nl/ 
2
Dutch Ships and Sailors (DSS) datasets 
3
Results published as Linked Data 
4
Data visualizations 
5
This study 
• Increasing number of historical databases are 
being digitized 
• Finding matching occurrences of the same 
object in different datasets is both relevant 
(for historical research) and non-trivial 
– “Instance mapping” 
• This paper: case study of linking ship instances 
in two maritime datasets 
6
Focus on methodology 
• This study is not about developing new 
techniques 
• This study is about methodology: 
– What combination of existing techniques gets the 
“best” result? 
– What the “best” result is depends on context (i.e., 
goal of the historical research) 
• This is a case study, so be wary of 
generalization 
7
Data 
• Muster rolls (Northern Dutch Maritime 
Museum) 
– Period: 1803-1937 
– 77,043 records of 34,552 sea men 
– 17,098 mentions of 4,935 ships 
• Newspaper archive (Dutch National Library) 
– Period: 1618-1995 
– 7K newspapers, 9M pages (coverage: 10%) 
– Text generated via OCR 
8
Timeline newspapers in the archive 
9
Example muster roll record (in Dutch) 
10
Example newspaper article (in Dutch) 
11
Approach 
• Generate candidate set of links 
• Apply two types of filters to the candidate set 
– Domain-specific filtering 
• Using domain heuristics about ship identification 
– Text classification of newspaper articles 
• Determine whether the article is about a ship 
• Combine filters 
12
Baseline generation 
• Find all ship instances in the muster rolls 
• Query newspaper archive for first 100 hits 
with this name 
– API: http://www.delpher.nl/ 
• Result set is expected to have high recall but 
low precision 
13
Evaluation 
• No gold standard 
• Manual assessment of all links is infeasible 
• Sampling method for evaluating candidates 
– 50 candidates per technique 
– 3 assessors (domain expert plus two authors) 
– Inter-observer agreement: Cohen’s kappa = 0.65 
• Recall: approximation, based on the estimated 
number of correct links (using the baseline) 
14
Domain-specific filtering 
• Heuristic 1: co-occurrence of name of ship 
captain 
– Common practice in historical maritime 
documentation 
• Heuristic 2: date of newspaper article is within 
ship lifetime (as indicated by muster roll) 
– Average life span of ship is 30 years 
15
Text classification 
• Task: decide whether a newspaper article is 
about a ship 
• Two techniques used 
– Naive Bayes and Support Vector Machine (SVM) 
with Sequential Minimal Optimisation (SMO) 
– WEKA implementation 
– Training set: 200 samples (121 positive, 79 
negative) 
16
Configuration 
• Filter 1a: captain name 
• Filter 1b: time restriction 
• Filter 2: combine filters 1a + 1b 
• Filter 2 + text classification 
17
Results 
18
Analysis 
• Captain’s name turns out to be a strong 
heuristic 
• Time restriction much less useful 
• When combined, precision becomes very 
high, at the cost of (approximate) recall 
• Text classification has high precision (no false 
positives) 
• Text classification combined with heuristic 
filtering has negative effect 
19
Discussion 
• Interestingly, the historian preferred very high 
precision at the cost of recall 
• Consequently, 16K links published as Linked 
Data (precision 0.96; approximate recall 0.13) 
• Links are to departure/arrival listing, but also 
to shipwrecks and sales 
• In case of good heuristics the contribution of 
generic techniques is at best minimal 
• Absence of gold standard is realistic 
20
Limitations 
• Evaluation 
– 50 samples 
– Choice of assessors 
– Approximation of recall 
• Data 
– OCR quality of newspaper articles 
– Digitized newspaper archive covers only 10% 
21
Acknowledgements 
• Jurjen Leinenga, domain expert 
• CLARIN-NL 
http://www.clarin.nl 
• BiographyNet, Netherlands eScience Center 
http://esciencecenter.nl 
• Online appendix with details of results at 
http://dx.doi.org/10.6084/m9.figshare.1189228 
22
QUESTION TIME 
23

More Related Content

What's hot

DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...
DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...
DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...Deltares
 
DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...
DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...
DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...Deltares
 
DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...
DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...
DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...Deltares
 
DSD-INT 2017 Water level predictions for the German North Sea coast - Stockmann
DSD-INT 2017 Water level predictions for the German North Sea coast - StockmannDSD-INT 2017 Water level predictions for the German North Sea coast - Stockmann
DSD-INT 2017 Water level predictions for the German North Sea coast - StockmannDeltares
 
DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...
DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...
DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...Deltares
 
DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...
DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...
DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...Deltares
 
DSD-INT 2018 Investigation of optimization options for polder flooding at the...
DSD-INT 2018 Investigation of optimization options for polder flooding at the...DSD-INT 2018 Investigation of optimization options for polder flooding at the...
DSD-INT 2018 Investigation of optimization options for polder flooding at the...Deltares
 
DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...
DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...
DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...Deltares
 
DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...
DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...
DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...Deltares
 
DSD-INT 2018 River Temperature Modeling, USA - Boyington
DSD-INT 2018 River Temperature Modeling, USA - BoyingtonDSD-INT 2018 River Temperature Modeling, USA - Boyington
DSD-INT 2018 River Temperature Modeling, USA - BoyingtonDeltares
 
DSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De Goede
DSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De GoedeDSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De Goede
DSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De GoedeDeltares
 
Paper reading: HashKV and beyond
Paper reading: HashKV and beyondPaper reading: HashKV and beyond
Paper reading: HashKV and beyondPingCAP
 
DSD-INT 2017 International collaboration within the Delft-FEWS system for the...
DSD-INT 2017 International collaboration within the Delft-FEWS system for the...DSD-INT 2017 International collaboration within the Delft-FEWS system for the...
DSD-INT 2017 International collaboration within the Delft-FEWS system for the...Deltares
 
DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...
DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...
DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...Deltares
 
DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...
DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...
DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...Deltares
 
DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...
DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...
DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...Deltares
 
DSD-INT 2017 WFlow - Delft-FEWS coupling - Hegnauer
DSD-INT 2017 WFlow - Delft-FEWS coupling - HegnauerDSD-INT 2017 WFlow - Delft-FEWS coupling - Hegnauer
DSD-INT 2017 WFlow - Delft-FEWS coupling - HegnauerDeltares
 

What's hot (17)

DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...
DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...
DSD-INT 2018 Long-term streamflow forecasting for waterway transport in Centr...
 
DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...
DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...
DSD-INT 2019 New generation models for the Dutch government in Delft3D FM Sui...
 
DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...
DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...
DSD-INT 2017 Pre-operational probabilistic water-level forecasting with FEWS-...
 
DSD-INT 2017 Water level predictions for the German North Sea coast - Stockmann
DSD-INT 2017 Water level predictions for the German North Sea coast - StockmannDSD-INT 2017 Water level predictions for the German North Sea coast - Stockmann
DSD-INT 2017 Water level predictions for the German North Sea coast - Stockmann
 
DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...
DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...
DSD-INT 2018 Can we combine satellite derived Soil Moisture with hydrological...
 
DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...
DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...
DSD-INT 2018 Latest developments in Dutch river applications using the Delft3...
 
DSD-INT 2018 Investigation of optimization options for polder flooding at the...
DSD-INT 2018 Investigation of optimization options for polder flooding at the...DSD-INT 2018 Investigation of optimization options for polder flooding at the...
DSD-INT 2018 Investigation of optimization options for polder flooding at the...
 
DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...
DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...
DSD-INT 2018 Experiences from modelling a branched lowland river in NE German...
 
DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...
DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...
DSD-INT 2018 Urban flooding and the Delft3D FM 1D2D capabilities - Washington...
 
DSD-INT 2018 River Temperature Modeling, USA - Boyington
DSD-INT 2018 River Temperature Modeling, USA - BoyingtonDSD-INT 2018 River Temperature Modeling, USA - Boyington
DSD-INT 2018 River Temperature Modeling, USA - Boyington
 
DSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De Goede
DSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De GoedeDSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De Goede
DSD-INT 2018 Delft3D FM - validation of hydrodynamics (2D,3D) - De Goede
 
Paper reading: HashKV and beyond
Paper reading: HashKV and beyondPaper reading: HashKV and beyond
Paper reading: HashKV and beyond
 
DSD-INT 2017 International collaboration within the Delft-FEWS system for the...
DSD-INT 2017 International collaboration within the Delft-FEWS system for the...DSD-INT 2017 International collaboration within the Delft-FEWS system for the...
DSD-INT 2017 International collaboration within the Delft-FEWS system for the...
 
DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...
DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...
DSD-INT 2018 Flood modeling in rural areas due to extreme precipitation or le...
 
DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...
DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...
DSD-INT 2018 HydPy framework for developing and sharing hydrological models a...
 
DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...
DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...
DSD-INT 2018 Hydrodynamic and Water Quality modelization of Cuerda del Pozo r...
 
DSD-INT 2017 WFlow - Delft-FEWS coupling - Hegnauer
DSD-INT 2017 WFlow - Delft-FEWS coupling - HegnauerDSD-INT 2017 WFlow - Delft-FEWS coupling - Hegnauer
DSD-INT 2017 WFlow - Delft-FEWS coupling - Hegnauer
 

Viewers also liked

Semantics for visual resources: use cases from e-culture
Semantics for visual resources: use cases from e-cultureSemantics for visual resources: use cases from e-culture
Semantics for visual resources: use cases from e-cultureGuus Schreiber
 
NoTube: integrating TV and Web with the help of semantics
NoTube: integrating TV and Web with the help of semanticsNoTube: integrating TV and Web with the help of semantics
NoTube: integrating TV and Web with the help of semanticsGuus Schreiber
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the WebGuus Schreiber
 
Semantics and the Humanities: some lessons from my journey 2000-2012
Semantics and the Humanities: some lessons from my journey 2000-2012Semantics and the Humanities: some lessons from my journey 2000-2012
Semantics and the Humanities: some lessons from my journey 2000-2012Guus Schreiber
 
Ontologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture WebOntologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture WebGuus Schreiber
 
Web Science: the digital heritage case
Web Science: the digital heritage caseWeb Science: the digital heritage case
Web Science: the digital heritage caseGuus Schreiber
 
The artof of knowledge engineering, or: knowledge engineering of art
The artof of knowledge engineering, or: knowledge engineering of artThe artof of knowledge engineering, or: knowledge engineering of art
The artof of knowledge engineering, or: knowledge engineering of artGuus Schreiber
 
Principles for knowledge engineering on the Web
Principles for knowledge engineering on the WebPrinciples for knowledge engineering on the Web
Principles for knowledge engineering on the WebGuus Schreiber
 
Principles and pragmatics of a Semantic Culture Web
 Principles and pragmatics of a Semantic Culture Web Principles and pragmatics of a Semantic Culture Web
Principles and pragmatics of a Semantic Culture WebGuus Schreiber
 
How the Semantic Web is transforming information access
How the Semantic Web is transforming information accessHow the Semantic Web is transforming information access
How the Semantic Web is transforming information accessGuus Schreiber
 

Viewers also liked (10)

Semantics for visual resources: use cases from e-culture
Semantics for visual resources: use cases from e-cultureSemantics for visual resources: use cases from e-culture
Semantics for visual resources: use cases from e-culture
 
NoTube: integrating TV and Web with the help of semantics
NoTube: integrating TV and Web with the help of semanticsNoTube: integrating TV and Web with the help of semantics
NoTube: integrating TV and Web with the help of semantics
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the Web
 
Semantics and the Humanities: some lessons from my journey 2000-2012
Semantics and the Humanities: some lessons from my journey 2000-2012Semantics and the Humanities: some lessons from my journey 2000-2012
Semantics and the Humanities: some lessons from my journey 2000-2012
 
Ontologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture WebOntologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture Web
 
Web Science: the digital heritage case
Web Science: the digital heritage caseWeb Science: the digital heritage case
Web Science: the digital heritage case
 
The artof of knowledge engineering, or: knowledge engineering of art
The artof of knowledge engineering, or: knowledge engineering of artThe artof of knowledge engineering, or: knowledge engineering of art
The artof of knowledge engineering, or: knowledge engineering of art
 
Principles for knowledge engineering on the Web
Principles for knowledge engineering on the WebPrinciples for knowledge engineering on the Web
Principles for knowledge engineering on the Web
 
Principles and pragmatics of a Semantic Culture Web
 Principles and pragmatics of a Semantic Culture Web Principles and pragmatics of a Semantic Culture Web
Principles and pragmatics of a Semantic Culture Web
 
How the Semantic Web is transforming information access
How the Semantic Web is transforming information accessHow the Semantic Web is transforming information access
How the Semantic Web is transforming information access
 

Similar to Linking historical ship records to a newspaper archive

10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation SlidesDuraSpace
 
RLUK Warwick Meeting | Iron Mountain, Jeremy Suratt
RLUK Warwick Meeting | Iron Mountain, Jeremy SurattRLUK Warwick Meeting | Iron Mountain, Jeremy Suratt
RLUK Warwick Meeting | Iron Mountain, Jeremy SurattResearchLibrariesUK
 
State of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing MetricsState of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing MetricsAudra Eagle Yun
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientistsinside-BigData.com
 
DART AARG Presentation Siena 2009
DART AARG Presentation Siena 2009DART AARG Presentation Siena 2009
DART AARG Presentation Siena 2009DART Project
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Freight Logistics Fundamentals
Freight Logistics FundamentalsFreight Logistics Fundamentals
Freight Logistics FundamentalsAlan Erera
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesSawood Alam
 
24-ad-hoc.ppt
24-ad-hoc.ppt24-ad-hoc.ppt
24-ad-hoc.pptsumadi26
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Prospection, Prediction and Management of Archaeological Sites in Alluvial En...
Prospection, Prediction and Management of Archaeological Sites in Alluvial En...Prospection, Prediction and Management of Archaeological Sites in Alluvial En...
Prospection, Prediction and Management of Archaeological Sites in Alluvial En...Keith Challis
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationNattiya Kanhabua
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesSawood Alam
 
Clustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckClustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckMohamed BEN ELLEFI
 
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...Paolo Corti
 
TRB_AC60-Stormwater and Sediment-FINAL
TRB_AC60-Stormwater and Sediment-FINALTRB_AC60-Stormwater and Sediment-FINAL
TRB_AC60-Stormwater and Sediment-FINALcdmoody0
 

Similar to Linking historical ship records to a newspaper archive (20)

10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
 
RLUK Warwick Meeting | Iron Mountain, Jeremy Suratt
RLUK Warwick Meeting | Iron Mountain, Jeremy SurattRLUK Warwick Meeting | Iron Mountain, Jeremy Suratt
RLUK Warwick Meeting | Iron Mountain, Jeremy Suratt
 
Name 457 maritime economics and management ship design
Name 457 maritime economics and management  ship designName 457 maritime economics and management  ship design
Name 457 maritime economics and management ship design
 
State of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing MetricsState of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing Metrics
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientists
 
DART AARG Presentation Siena 2009
DART AARG Presentation Siena 2009DART AARG Presentation Siena 2009
DART AARG Presentation Siena 2009
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Freight Logistics Fundamentals
Freight Logistics FundamentalsFreight Logistics Fundamentals
Freight Logistics Fundamentals
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
24-ad-hoc.ppt
24-ad-hoc.ppt24-ad-hoc.ppt
24-ad-hoc.ppt
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Prospection, Prediction and Management of Archaeological Sites in Alluvial En...
Prospection, Prediction and Management of Archaeological Sites in Alluvial En...Prospection, Prediction and Management of Archaeological Sites in Alluvial En...
Prospection, Prediction and Management of Archaeological Sites in Alluvial En...
 
On the two sides of the pond
On the two sides of the pondOn the two sides of the pond
On the two sides of the pond
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Clustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckClustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreck
 
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
 
TRB_AC60-Stormwater and Sediment-FINAL
TRB_AC60-Stormwater and Sediment-FINALTRB_AC60-Stormwater and Sediment-FINAL
TRB_AC60-Stormwater and Sediment-FINAL
 

More from Guus Schreiber

Ontologies: vehicles for reuse
Ontologies: vehicles for reuseOntologies: vehicles for reuse
Ontologies: vehicles for reuseGuus Schreiber
 
CommonKADS project management
CommonKADS project managementCommonKADS project management
CommonKADS project managementGuus Schreiber
 
UML notations used by CommonKADS
UML notations used by CommonKADSUML notations used by CommonKADS
UML notations used by CommonKADSGuus Schreiber
 
Advanced knowledge modelling
Advanced knowledge modellingAdvanced knowledge modelling
Advanced knowledge modellingGuus Schreiber
 
CommonKADS design and implementation
CommonKADS design and implementationCommonKADS design and implementation
CommonKADS design and implementationGuus Schreiber
 
CommonKADS communication model
CommonKADS communication modelCommonKADS communication model
CommonKADS communication modelGuus Schreiber
 
CommonKADS knowledge modelling process
CommonKADS knowledge modelling processCommonKADS knowledge modelling process
CommonKADS knowledge modelling processGuus Schreiber
 
CommonKADS knowledge model templates
CommonKADS knowledge model templatesCommonKADS knowledge model templates
CommonKADS knowledge model templatesGuus Schreiber
 
CommonKADS knowledge modelling basics
CommonKADS knowledge modelling basicsCommonKADS knowledge modelling basics
CommonKADS knowledge modelling basicsGuus Schreiber
 
CommonKADS knowledge management
CommonKADS knowledge managementCommonKADS knowledge management
CommonKADS knowledge managementGuus Schreiber
 
CommonKADS context models
CommonKADS context modelsCommonKADS context models
CommonKADS context modelsGuus Schreiber
 
Semantic Web: From Representations to Applications
Semantic Web: From Representations to ApplicationsSemantic Web: From Representations to Applications
Semantic Web: From Representations to ApplicationsGuus Schreiber
 
The Semantic Web: status and prospects
The Semantic Web: status and prospectsThe Semantic Web: status and prospects
The Semantic Web: status and prospectsGuus Schreiber
 
E-Culture semantic search pilot
E-Culture semantic search pilotE-Culture semantic search pilot
E-Culture semantic search pilotGuus Schreiber
 
Ontology Engineering: Ontology Use
Ontology Engineering: Ontology UseOntology Engineering: Ontology Use
Ontology Engineering: Ontology UseGuus Schreiber
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignmentGuus Schreiber
 
Ontology Engineering: Ontology evaluation
Ontology Engineering: Ontology evaluationOntology Engineering: Ontology evaluation
Ontology Engineering: Ontology evaluationGuus Schreiber
 
Ontology Engineering: ontology construction II
Ontology Engineering: ontology construction IIOntology Engineering: ontology construction II
Ontology Engineering: ontology construction IIGuus Schreiber
 

More from Guus Schreiber (20)

Ontologies: vehicles for reuse
Ontologies: vehicles for reuseOntologies: vehicles for reuse
Ontologies: vehicles for reuse
 
CommonKADS project management
CommonKADS project managementCommonKADS project management
CommonKADS project management
 
UML notations used by CommonKADS
UML notations used by CommonKADSUML notations used by CommonKADS
UML notations used by CommonKADS
 
Advanced knowledge modelling
Advanced knowledge modellingAdvanced knowledge modelling
Advanced knowledge modelling
 
CommonKADS design and implementation
CommonKADS design and implementationCommonKADS design and implementation
CommonKADS design and implementation
 
CommonKADS communication model
CommonKADS communication modelCommonKADS communication model
CommonKADS communication model
 
CommonKADS knowledge modelling process
CommonKADS knowledge modelling processCommonKADS knowledge modelling process
CommonKADS knowledge modelling process
 
CommonKADS knowledge model templates
CommonKADS knowledge model templatesCommonKADS knowledge model templates
CommonKADS knowledge model templates
 
CommonKADS knowledge modelling basics
CommonKADS knowledge modelling basicsCommonKADS knowledge modelling basics
CommonKADS knowledge modelling basics
 
CommonKADS knowledge management
CommonKADS knowledge managementCommonKADS knowledge management
CommonKADS knowledge management
 
CommonKADS context models
CommonKADS context modelsCommonKADS context models
CommonKADS context models
 
Introduction
IntroductionIntroduction
Introduction
 
Semantic Web: From Representations to Applications
Semantic Web: From Representations to ApplicationsSemantic Web: From Representations to Applications
Semantic Web: From Representations to Applications
 
The Semantic Web: status and prospects
The Semantic Web: status and prospectsThe Semantic Web: status and prospects
The Semantic Web: status and prospects
 
E-Culture semantic search pilot
E-Culture semantic search pilotE-Culture semantic search pilot
E-Culture semantic search pilot
 
Vista-TV overview
Vista-TV overviewVista-TV overview
Vista-TV overview
 
Ontology Engineering: Ontology Use
Ontology Engineering: Ontology UseOntology Engineering: Ontology Use
Ontology Engineering: Ontology Use
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignment
 
Ontology Engineering: Ontology evaluation
Ontology Engineering: Ontology evaluationOntology Engineering: Ontology evaluation
Ontology Engineering: Ontology evaluation
 
Ontology Engineering: ontology construction II
Ontology Engineering: ontology construction IIOntology Engineering: ontology construction II
Ontology Engineering: ontology construction II
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Linking historical ship records to a newspaper archive

  • 1. Linking historical ship records to a newspaper archive Andrea Bravo Balado Victor de Boer, Guus Schreiber VU University Amsterdam
  • 3. Dutch Ships and Sailors (DSS) datasets 3
  • 4. Results published as Linked Data 4
  • 6. This study • Increasing number of historical databases are being digitized • Finding matching occurrences of the same object in different datasets is both relevant (for historical research) and non-trivial – “Instance mapping” • This paper: case study of linking ship instances in two maritime datasets 6
  • 7. Focus on methodology • This study is not about developing new techniques • This study is about methodology: – What combination of existing techniques gets the “best” result? – What the “best” result is depends on context (i.e., goal of the historical research) • This is a case study, so be wary of generalization 7
  • 8. Data • Muster rolls (Northern Dutch Maritime Museum) – Period: 1803-1937 – 77,043 records of 34,552 sea men – 17,098 mentions of 4,935 ships • Newspaper archive (Dutch National Library) – Period: 1618-1995 – 7K newspapers, 9M pages (coverage: 10%) – Text generated via OCR 8
  • 9. Timeline newspapers in the archive 9
  • 10. Example muster roll record (in Dutch) 10
  • 11. Example newspaper article (in Dutch) 11
  • 12. Approach • Generate candidate set of links • Apply two types of filters to the candidate set – Domain-specific filtering • Using domain heuristics about ship identification – Text classification of newspaper articles • Determine whether the article is about a ship • Combine filters 12
  • 13. Baseline generation • Find all ship instances in the muster rolls • Query newspaper archive for first 100 hits with this name – API: http://www.delpher.nl/ • Result set is expected to have high recall but low precision 13
  • 14. Evaluation • No gold standard • Manual assessment of all links is infeasible • Sampling method for evaluating candidates – 50 candidates per technique – 3 assessors (domain expert plus two authors) – Inter-observer agreement: Cohen’s kappa = 0.65 • Recall: approximation, based on the estimated number of correct links (using the baseline) 14
  • 15. Domain-specific filtering • Heuristic 1: co-occurrence of name of ship captain – Common practice in historical maritime documentation • Heuristic 2: date of newspaper article is within ship lifetime (as indicated by muster roll) – Average life span of ship is 30 years 15
  • 16. Text classification • Task: decide whether a newspaper article is about a ship • Two techniques used – Naive Bayes and Support Vector Machine (SVM) with Sequential Minimal Optimisation (SMO) – WEKA implementation – Training set: 200 samples (121 positive, 79 negative) 16
  • 17. Configuration • Filter 1a: captain name • Filter 1b: time restriction • Filter 2: combine filters 1a + 1b • Filter 2 + text classification 17
  • 19. Analysis • Captain’s name turns out to be a strong heuristic • Time restriction much less useful • When combined, precision becomes very high, at the cost of (approximate) recall • Text classification has high precision (no false positives) • Text classification combined with heuristic filtering has negative effect 19
  • 20. Discussion • Interestingly, the historian preferred very high precision at the cost of recall • Consequently, 16K links published as Linked Data (precision 0.96; approximate recall 0.13) • Links are to departure/arrival listing, but also to shipwrecks and sales • In case of good heuristics the contribution of generic techniques is at best minimal • Absence of gold standard is realistic 20
  • 21. Limitations • Evaluation – 50 samples – Choice of assessors – Approximation of recall • Data – OCR quality of newspaper articles – Digitized newspaper archive covers only 10% 21
  • 22. Acknowledgements • Jurjen Leinenga, domain expert • CLARIN-NL http://www.clarin.nl • BiographyNet, Netherlands eScience Center http://esciencecenter.nl • Online appendix with details of results at http://dx.doi.org/10.6084/m9.figshare.1189228 22