For research institutes, data libraries, and data
archives, RDF data validation according to predefined constraints
is a much sought-after feature, particularly as this is taken
for granted in the XML world. Based on our work in the
DCMI RDF Application Profiles Task Group and in cooperation
with the W3C Data Shapes Working Group, we identified and
published by today 81 types of constraints that are required
by various stakeholders for data applications. In this paper,
in collaboration with several domain experts we formulate 115
constraints on three different vocabularies (DDI-RDF, QB, and
SKOS) and classify them according to (1) the severity of an
occurring violation and (2) the complexity of the constraint
expression in common constraint languages. We evaluate the
data quality of 15,694 data sets (4.26 billion triples) of research
data for the social, behavioral, and economic sciences obtained
from 33 SPARQL endpoints. Based on the results, we formulate
several findings to direct the further development of constraint
languages.
Powerpoint exploring the locations used in television show Time Clash
2016.02 - Validating RDF Data Quality using Constraints to Direct the Development of Constraint Languages (ICSC 2016)
1. Validating RDF Data Quality
using Constraints
to Direct the Development of Constraint Languages
Thomas Hartmann
Benjamin Zapilko, Joachim Wackerow, Kai Eckert
International Conference on Semantic Systems (ICSC 2016)
3. <!ELEMENT library (book+, author*)>
<!ELEMENT book (isbn, title, author-ref+)>
<!ATTLIST book
id ID #REQUIRED
>
<!ELEMENT author-ref EMPTY>
<!ATTLIST author-ref
id IDREF #REQUIRED
>
<!ELEMENT author (name)>
<!ATTLIST author
id ID #REQUIRED
>
<!ELEMENT isbn (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT name (#PCDATA)>
22. Evaluation Setup
• 115 constraints from vocabularies and
experts
• constraints classified and implemented
• on 3 vocabularies in the SBE sciences
– well-established vocabularies (QB, SKOS)
– vocabulary under development (DDI-RDF)
23. Validated Data Sets
Vocabulary Data Sets Triples
QB 9,990 3,775,983,610
SKOS 4,178 477,737,281
DDI-RDF 1,526 9,673,055
Total 15,694 4.26 billion
33 SPARQL Endpoints