Linking historical ship records to a newspaper archive

Linking historical ship records to
a newspaper archive
Andrea Bravo Balado
Victor de Boer, Guus Schreiber
VU University Amsterdam

Context: dutchshipsandsailors.nl/
2

Dutch Ships and Sailors (DSS) datasets
3

Results published as Linked Data
4

This study
• Increasing number of historical databases are
being digitized
• Finding matching occurrences of the same
object in different datasets is both relevant
(for historical research) and non-trivial
– “Instance mapping”
• This paper: case study of linking ship instances
in two maritime datasets
6

Focus on methodology
• This study is not about developing new
techniques
• This study is about methodology:
– What combination of existing techniques gets the
“best” result?
– What the “best” result is depends on context (i.e.,
goal of the historical research)
• This is a case study, so be wary of
generalization
7

Data
• Muster rolls (Northern Dutch Maritime
Museum)
– Period: 1803-1937
– 77,043 records of 34,552 sea men
– 17,098 mentions of 4,935 ships
• Newspaper archive (Dutch National Library)
– Period: 1618-1995
– 7K newspapers, 9M pages (coverage: 10%)
– Text generated via OCR
8

Timeline newspapers in the archive
9

Example muster roll record (in Dutch)
10

Example newspaper article (in Dutch)
11

Approach
• Generate candidate set of links
• Apply two types of filters to the candidate set
– Domain-specific filtering
• Using domain heuristics about ship identification
– Text classification of newspaper articles
• Determine whether the article is about a ship
• Combine filters
12

Baseline generation
• Find all ship instances in the muster rolls
• Query newspaper archive for first 100 hits
with this name
– API: http://www.delpher.nl/
• Result set is expected to have high recall but
low precision
13

Evaluation
• No gold standard
• Manual assessment of all links is infeasible
• Sampling method for evaluating candidates
– 50 candidates per technique
– 3 assessors (domain expert plus two authors)
– Inter-observer agreement: Cohen’s kappa = 0.65
• Recall: approximation, based on the estimated
number of correct links (using the baseline)
14

Domain-specific filtering
• Heuristic 1: co-occurrence of name of ship
captain
– Common practice in historical maritime
documentation
• Heuristic 2: date of newspaper article is within
ship lifetime (as indicated by muster roll)
– Average life span of ship is 30 years
15

Text classification
• Task: decide whether a newspaper article is
about a ship
• Two techniques used
– Naive Bayes and Support Vector Machine (SVM)
with Sequential Minimal Optimisation (SMO)
– WEKA implementation
– Training set: 200 samples (121 positive, 79
negative)
16

Configuration
• Filter 1a: captain name
• Filter 1b: time restriction
• Filter 2: combine filters 1a + 1b
• Filter 2 + text classification
17

Analysis
• Captain’s name turns out to be a strong
heuristic
• Time restriction much less useful
• When combined, precision becomes very
high, at the cost of (approximate) recall
• Text classification has high precision (no false
positives)
• Text classification combined with heuristic
filtering has negative effect
19

Discussion
• Interestingly, the historian preferred very high
precision at the cost of recall
• Consequently, 16K links published as Linked
Data (precision 0.96; approximate recall 0.13)
• Links are to departure/arrival listing, but also
to shipwrecks and sales
• In case of good heuristics the contribution of
generic techniques is at best minimal
• Absence of gold standard is realistic
20

Limitations
• Evaluation
– 50 samples
– Choice of assessors
– Approximation of recall
• Data
– OCR quality of newspaper articles
– Digitized newspaper archive covers only 10%
21

Acknowledgements
• Jurjen Leinenga, domain expert
• CLARIN-NL
http://www.clarin.nl
• BiographyNet, Netherlands eScience Center
http://esciencecenter.nl
• Online appendix with details of results at
http://dx.doi.org/10.6084/m9.figshare.1189228
22

Linking historical ship records to a newspaper archive

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (10)

Similar to Linking historical ship records to a newspaper archive

Similar to Linking historical ship records to a newspaper archive (20)

More from Guus Schreiber

More from Guus Schreiber (20)

Recently uploaded

Recently uploaded (20)

Linking historical ship records to a newspaper archive