The talk from TDWG 2015 presents a simple model for the mobilization of biodiversity data from a data rich, diverse organisation, based on open source tools compatible with those taught in the Data Carpentry syllabus.
The talk presents an open-source toolkit (https://github.com/RBGKew/Reconciliation-and-Matching-Framework ; http://data1.kew.org/reconciliation/) to configure an Open Refine (http://openrefine.org/) compatible reconciliation service over any tabular file or structured database. "Reconciliation" is the process of converting a text string representation of a thing into a usable identifier for that thing, e.g. to convert the text string "Tahina spectabilis" to "http://ipni.org/urn:lsid:ipni.org:names:77086615-1". Although the toolkit was developed first for scientific name reconciliation, it can be configured to reconcile any entity type (people, specimens etc). Micro-components of the tool (for data transformations - https://github.com/RBGKew/String-Transformers) are available as drop-ins in the Open Refine data cleaning package. This approach is an alternative to existing services development, which have largely been aimed at technical users. The guiding principle is to open data services to a wider range of users by lowering the barrier to entry, such that hands-on scientists and data curators - those who know their data best - can link it with external sources. Technical choices were made to fit with approaches taught in the software and data carpentry initiatives (http://datacarpentry.org/). The toolkit aids progress towards Tim Berners-Lee’s Linked Open Data principle #4 "Refer to other things using their HTTP URI-based names when publishing data on the Web" and shows how we can build the foundations of the biodiversity knowledge graph.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Tdwg 2015-nicolson-kew-mobilisation
1. A simple model for large-scale data
mobilization across a diverse
organisation
Nicky Nicolson, RBG Kew
@nickynicolson
Biodiversity Information Standards (TDWG) annual meeting
Nairobi, Kenya / 28th September – 1 October 2015
28. Connecting name data to other resources
Schinus longifolius var. paraguariensis
(Hassler) F. Barkley
Taxonomic status of 229196-2?
229196-2
Synonym
36. We’ve converted a name to an identifier
Schinus longifolius var. paraguariensis
(Hassler) F. Barkley
229196-2
Now we can use that identifier to add in
more data…
37.
38.
39.
40. Connecting name data to other resources
Schinus longifolius var. paraguariensis
(Hassler) F. Barkley
Taxonomic status of 229196-2?
229196-2
Synonym
44. Thanks to:
• Biodiversity Informatics team (Abigail Barker,
Matt Blissett, James Crowe, John Iacona, Rob
Turner, Alecs Gueder)
• Plant & fungal name curation team (Christine
Barker / Irina Belyaeva / Katherine Challis /
Rafael Govaerts / Paul Kirk / Heather Lindon /
Emma Williams)
• Data improvement team (Anna Lynch, Rachel
Witherow, Malin Rivers, Esther Wainwright-Deri)
This shows the kinds of data elements that Kew has collected and how they interlink to form a “knowledge graph”
Fieldwork
…carried out in a particular geographical region…
… collects physical material…
… accessioned into multiple specialist collections (e.g. herbarium, DNA bank, seed & living collections)
Duplicate specimens are shared with other organisations
Individual researchers, teams, and organisations are represented as agents
One key activity is for researchers to label specimens with determinations
A determination is a link between a specimen and the concept that it represents
Concepts fit into classifications
The core of the concept is the name, which has a special link to a specimen via type citation
Names and classifications are published in scientific literature, accessed via bibliographic citations
Concepts can be mapped to management classifications (for reporting purposes) and to phylogenies
Finally – once we have recognised species, we can assert facts about them – e.g. their physical characteristics, traits, distributions and uses
In summary: elements about the physical specimens
… Elements which use those physical specimens to define and name species
Assertions about species
Here, we show elements which are shared with other scientific and / or academic domains: geographic localities, people/teams/organisations and scholarly literature
If we want this rich graph of data, how do we build it?
Deb’s talk : what’s an API?
A walkthrough of matching up a dataset containing names to some Kew data resources
Match the names against IPNI, get an identifier, ask other resources what they know about that identifier. (i.e. names matching isolated into one place)
Reconciliation service configured to run against an IPNIN dataset.
Can be configured to expose any tabular dataset or result-set from a relational DB.
Data first transformed – using a set of rules defined in configuration – then matched.
Transformations handle things like gender agreement: "-us" on one side and "-a" on the other transform to the same form.
Transformers can also handle authorship: "F. Barkley" and "F.A.Barkley“ are probably the same.
A user can explore the service using a web interface
This shows the results of the query.
Use open refine for large volumes of data – load into Open Refine, identify the column of scientific names that you want to “reconcile” (send to the service), choose “Reconcile” “Start Reconciling” on the column of name data
Select the IPNI service
The data are sent to the service (via JSON over HTTP), and IPNI ids (with hyperlinks to IPNI) are brought back.
The ID can be extracted and held in its own column.
Choose “Add columns from TPL…”
TPL gives us a list of “properties” it knows names have. I’ve chosen “taxonomic status”, and there’s a preview on the right.
A minute later, and an extra column is added with the status from TPL.
Fourth from the top is a synonym, but this real dataset shouldn’t have had any synonyms in.
Users of the service include research & development staff at many institutes – largely without support from Kew (using Open Refine user support material)
Linking the data like this enables us to do different kinds of research