6. This study
• Increasing number of historical databases are
being digitized
• Finding matching occurrences of the same
object in different datasets is both relevant
(for historical research) and non-trivial
– “Instance mapping”
• This paper: case study of linking ship instances
in two maritime datasets
6
7. Focus on methodology
• This study is not about developing new
techniques
• This study is about methodology:
– What combination of existing techniques gets the
“best” result?
– What the “best” result is depends on context (i.e.,
goal of the historical research)
• This is a case study, so be wary of
generalization
7
8. Data
• Muster rolls (Northern Dutch Maritime
Museum)
– Period: 1803-1937
– 77,043 records of 34,552 sea men
– 17,098 mentions of 4,935 ships
• Newspaper archive (Dutch National Library)
– Period: 1618-1995
– 7K newspapers, 9M pages (coverage: 10%)
– Text generated via OCR
8
12. Approach
• Generate candidate set of links
• Apply two types of filters to the candidate set
– Domain-specific filtering
• Using domain heuristics about ship identification
– Text classification of newspaper articles
• Determine whether the article is about a ship
• Combine filters
12
13. Baseline generation
• Find all ship instances in the muster rolls
• Query newspaper archive for first 100 hits
with this name
– API: http://www.delpher.nl/
• Result set is expected to have high recall but
low precision
13
14. Evaluation
• No gold standard
• Manual assessment of all links is infeasible
• Sampling method for evaluating candidates
– 50 candidates per technique
– 3 assessors (domain expert plus two authors)
– Inter-observer agreement: Cohen’s kappa = 0.65
• Recall: approximation, based on the estimated
number of correct links (using the baseline)
14
15. Domain-specific filtering
• Heuristic 1: co-occurrence of name of ship
captain
– Common practice in historical maritime
documentation
• Heuristic 2: date of newspaper article is within
ship lifetime (as indicated by muster roll)
– Average life span of ship is 30 years
15
16. Text classification
• Task: decide whether a newspaper article is
about a ship
• Two techniques used
– Naive Bayes and Support Vector Machine (SVM)
with Sequential Minimal Optimisation (SMO)
– WEKA implementation
– Training set: 200 samples (121 positive, 79
negative)
16
17. Configuration
• Filter 1a: captain name
• Filter 1b: time restriction
• Filter 2: combine filters 1a + 1b
• Filter 2 + text classification
17
19. Analysis
• Captain’s name turns out to be a strong
heuristic
• Time restriction much less useful
• When combined, precision becomes very
high, at the cost of (approximate) recall
• Text classification has high precision (no false
positives)
• Text classification combined with heuristic
filtering has negative effect
19
20. Discussion
• Interestingly, the historian preferred very high
precision at the cost of recall
• Consequently, 16K links published as Linked
Data (precision 0.96; approximate recall 0.13)
• Links are to departure/arrival listing, but also
to shipwrecks and sales
• In case of good heuristics the contribution of
generic techniques is at best minimal
• Absence of gold standard is realistic
20
21. Limitations
• Evaluation
– 50 samples
– Choice of assessors
– Approximation of recall
• Data
– OCR quality of newspaper articles
– Digitized newspaper archive covers only 10%
21
22. Acknowledgements
• Jurjen Leinenga, domain expert
• CLARIN-NL
http://www.clarin.nl
• BiographyNet, Netherlands eScience Center
http://esciencecenter.nl
• Online appendix with details of results at
http://dx.doi.org/10.6084/m9.figshare.1189228
22