How to clean data less through Linked (Open Data) approach?
1. How to Clean data Less through
Linked (Open Data) Approach
Andrea Wei-Ching Huang
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Dec. 7 2015 @ IIS R101
1. Data Quality: data, metadata, linked data
2. The case of 840,000 cc licensed data
3. How LOD approach can improve data quality?
5. Metadata Quality: Problems & Solutions (1)
Record Problems
Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
6. Metadata Quality: Problems & Solutions (2)
Dublin Core Semantic Problems
Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of
information professionals." Journal of Information Science 35.6 (2009): 727-739.
• Type is a subjective value.
• Source is a confusing field. It is difficult to apply it consistently.
• Creator can be very varied and it can be tricky determining exactly who the
creator is.
• The information from the publisher is vague.
• Can’t define different role of contributor.
• There is often great ambiguity in terms of Type and Relation.
• between Format and Type.
• between Creator, Publisher, and Contributor.
• between Source and Relation.
• The high degree of difficulty (55.3%) engendered by the Relation field
o discernment of the dynamic and interrelated nature of
information objects presents challenges in using the Relation
element.
7. Metadata Quality: Problems & Solutions (3)
Current Solutions
Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts."
Information Processing & Management 49.6 (2013): 1194-1205.2
Tani et al. (2013): Summary of metadata quality approaches .
-------------------------------------------------------------------------------------------------------------------------------------------
Metadata guidelines, standard and Application Profiles
Pros: potentially effective; if shared among organizations, they promote cross organization interoperability
Cons: challenging to agree between different organizations; often end-up being complex combinations of features
reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them
Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)
Pros: helpful to identify specific problems
Cons: based on community specific criteria
Semi-automatic metadata generation approaches
Pros: helpful to deal with the data deluge
Cons: human assessment
Metadata cleaning, enhancement, augmentation approaches
Pros: fundamental to enable cross-community exploitation of metadata
Cons: information loss; information inconsistency
-------------------------------------------------------------------------------------------------------------------------------------------
8. 2. The case of 840,000 cc licensed data
In Union Catalogue of Digital Archives Taiwan
9. “Fitness for Use” is the Key:
Data Quality (DQ) Definition for Digital Data
Nicholas R. Chrisman (1986):
“Digital data can adapt to a broader range of
uses with a broader range of special demand,
…The root of data abuse is not in the quality
of the data, but in the awareness and
understanding of the quality of the data. By
converting to the fitness for use approach,
the problem of data abuse is moved from
producer to consumer (data user).
W3C Data Quality Vocabulary (2015) :
“...quality lies in the eye of the beholder; that
there is no objective, ideal definition of it.
Some datasets will be judged as low-quality
resources by some data consumers, while
they will perfectly fit others' needs.
Quality from perspectives of supply
and demand sides:
ex. Data Publishers, Certification
Agencies, Data Aggregators and Data
Consumers.
Pragmatic
User-specific
Context-dependent
10. physical object digital object digital collection digital aggregation & publication reusing & semantic representation
Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5
Local Curation (90 projects)
DC 15 elements as the requirement
for Union Catalog
Locally developed schemes
Digital Archive Curation (1 portal)
XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle
Linked Open Data (globally linked & semantically represented)
Globally linked, machine accessible
semantics & domain knowledge vocabularies
are needed for LOD.
CONTEX I CONTEX II CONTEX III
“Fitness for Use” in different contexts:
11. physical object digital object digital collection digital aggregation & publication reusing & semantic representation
Digital Archive Curation (1 portal)
XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle
Linked Open Data (globally linked & semantically represented)
Provide metadata guidelines
& standard (DC 15)
Metadata Generation
Local Curation (90 projects)
Data
Quality
Data
Quality
Data
Quality
Linked Data Generation
Metadata evaluation
approaches
Semi-automatic metadata
generation approaches
Metadata
cleaning,
enhancement,
augmentation
approaches
Information Loss ?
Interpretation Problems?
Time & Resource Cost?
12. Problems identified in the case of 840,000 cc data
1. Confusion of Dublin Core (DC 定義混淆)
2. Name Ambiguity (名稱模糊)
3. Inconsistent Encoding (編碼不一致)
4. Semantic Overlaps (語意超載)
5. Duplicate Records (資料重複)
6. Insufficient Element Usage (語意缺漏)
7. Errors / Mistakes / Others (其它錯誤)
13. Considerations in the case of 840,000 cc data for LOD
1. We are not data creators. Can we
clean/revise the data “correctly”?
Keep original CSV data open.
Revised/Cleaned data as diff/mapping files.
2. How can we prevent “information loss”?
Mapping activities often result in information loss.
Reconsider the value of broken links.
3. Limited Resources & Time to handel the
clean tasks.
14. 3. How Linked (Open Data) approach can
improve data quality?
15. 1. Raw data, New data (cleaned data, semantically refined
data) can be benefited from Open Data Approach:
Creation of new data based on combining data.
External quality checks of data (validation).
Sustainability of data (no data loss).
The ability to merge, integrate and mesh public
and private data.
Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government."
Information Systems Management 29.4 (2012): 258-268.
16. 2. Using SPARQL Queries to identify problems:
Identify DQ Problems before RDF generated:
Use W3C mapping language R2RML and RDF validation
framework (RDFUnit) for mapping definitions and allow
publishers to catch & correct violations before they even
happened. (Dimou et al, 2015)
Identify DQ Problems after RDF generated:
Using SPARQL and Public Shared LOD resources (ex. DBPedia,
Geonames)as reference to identified problems. (Furber and
Hepp, 2010)
Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge
Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.
Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic
Web-ISWC 2015. Springer International Publishing, 2015. 133-149.
17. Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web
Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin
Heidelberg, 2013. 141-161.
Collaborative representation and use of quality-relevant knowledge
Automatic identification of conflicting data requirement
Semantic definition of data.
Use Semantic Web data as a Trusted Reference data
Content Integration with Ontologies
3. Use Vocabularies, Ontologies & LOD Knowledge Base :
To improve data quality for every step of a
dataset's lifecycle (ex. W3C Data Quality Vocabulary) .
To enrich data semantics and increase data
reused and refined values.
19. The Story of A Fish
http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html
Parapercis kentingensis
20. http://URI of this Fish/6277845
2012
2015 2016
TEXT/Image
XLSX/Table/HTML
XML/HTML
CSV: (raw data published as open data)
6277845
(1)12/15 triples (statements)
Metadata(DC 15)
(2) 12/15 triples (statements)
Provence
wikidataerr
+ one “diff” triple
new
(3) Mapping replace Cleaning
+ one “time mapping” triple
time
new
err
Place information is not described in the Coverage but
Description in this stage. This should be cleaned & mapped to
external resources like Geoname and TaiwanPlaceName by us, or
by some others when time and resources are available.
(5) When the raw CSV and DC 15 represented triples
(DC 15 Version) are published, they are easily for
others to detect the errors, reused and enriched by
their own Fitness of Use and Interpretations. Even
there are errors from the beginning, more statements
about this Fish (6277845) are thus can be generated
by the interests of community.
(4) Refined Version:
semantically enriched by using domain
vocabularies like Darwin Core Terms
prov r4r schema cc odw
21. 1. Keep original CSV data open.
2. Less clean with mapping more: revised/cleaned data as
diff/mapping files.
3. Publish the original DC 15 statements as 15 triples and
provide Provenance information.
4. Assign each item resource a URI.
5. Use domain vocabulary to enrich the resource (e.x. dwc)
6. Mapping and Linking to external databases to enrich
statements. (GenNames, TaiwanPlaceNames,
Encyclopedia of Life).
7. More errors or meanings will be stated by third parties
and crowdsourcing for their own interests.
How we clean data less through Linked (Open Data) Approach
22. 1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16.
2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986.
3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60.
4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015.
Springer International Publishing, 2015. 133-149
5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and
Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.
6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin
Heidelberg, 2013. 141-161
7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014).
8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open
government." Information Systems Management 29.4 (2012): 258-268.
9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/
10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals."
Journal of Information Science 35.6 (2009): 727-739.
11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and
Technology 58.12 (2007): 1720-1733.
12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information
Processing & Management 49.6 (2013): 1194-1205.
13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/
14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016).
REFERENCE
23. Merry Christmas
Happy New Year
We will release the DC 15 Versions and the Refined Version (Biology) shortly.