SlideShare a Scribd company logo
1 of 23
Download to read offline
How to Clean data Less through
Linked (Open Data) Approach
Andrea Wei-Ching Huang
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Dec. 7 2015 @ IIS R101
1. Data Quality: data, metadata, linked data
2. The case of 840,000 cc licensed data
3. How LOD approach can improve data quality?
1. Data Quality:
data, metadata, linked data
Information Quality Data Quality Metadata Quality Linked Data Quality Data Quality Vocabulary
Stvilia et al.(2007):
22 dimensions
Batini et al. (2009):
28 dimensions
Tani et al. (2013):
10 parameters
Zaveri et al. (2016):
18 dimensions
W3C (2015):
10 dimensions
Naturalness (I) Interoperability (RP) Statistics
Accessibility (R) Accessibility Accessibility Availability (A) Availability
Accuracy (R) Accuracy Accuracy (S) Semantic Accuracy (I) Accuracy
Accuracy/Validity (I) Applicability Pertinence Syntactic Validity (I)
Appropriate amount of data
Complexity (R) Clarity
Precision/Completeness(R) Completeness Completeness(S) Completeness (I) Completeness
Informativeness/Redundancy(R) Comprehensiveness Understandability (C)
Informativeness/Redundancy(I) Conciseness Conciseness (I)
Structural Consistency (I) Consistency Similarity Consistency (I) Consistency
Convenience
Structural Consistency(R) Correctness
Verifiability (R) Credibility Trustworthiness (C) Credibility
Currency (I) Currency
Semantic Consistency(I) Derivation Integrity
Ease of operation Processability
Naturalness (R) Interactivity Conformance(S) Interlinking (A) Conformance
Semantic Consistency(R) Interpretability Interpretability (RP)
Precision/Completeness(I) Maintainability Preservability
Complexity(I) Objectivity
Relevance/ Aboutness(R) Relevancy Relevance Relevancy (C) Relevance
Authority (Reputational) Reputation
Security(R) Security Security (A)
Speed Performance (A)
Timeliness Timeliness Timeliness (C) Timeliness
Traceability RP Conciseness (RP)
Cohesiveness (I) Uniqueness Significance
Usability Licensing (A)
Volatility(R) Volatility
Versatility (RP)
(I): Intrinsic; (R): Relational; (S): Metadata Spec.; (RP): Representational; (A):Accessibility; (C): Contextual
1. Accessibility/Availability (可取得性)
2. Accuracy (正確性)
3. Completeness (完整性)
4. Consistency (一致性)
5. Credibility/Trustworthiness (可信度)
6. Relevance (相關性)
7. Timeliness (適時性)
7 dimensions/parameters are common ground
Quantitative/ Qualitative
Methodologies are mutual utilized.
Metadata Quality: Problems & Solutions (1)
Record Problems
Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
Metadata Quality: Problems & Solutions (2)
Dublin Core Semantic Problems
Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of
information professionals." Journal of Information Science 35.6 (2009): 727-739.
• Type is a subjective value.
• Source is a confusing field. It is difficult to apply it consistently.
• Creator can be very varied and it can be tricky determining exactly who the
creator is.
• The information from the publisher is vague.
• Can’t define different role of contributor.
• There is often great ambiguity in terms of Type and Relation.
• between Format and Type.
• between Creator, Publisher, and Contributor.
• between Source and Relation.
• The high degree of difficulty (55.3%) engendered by the Relation field
o discernment of the dynamic and interrelated nature of
information objects presents challenges in using the Relation
element.
Metadata Quality: Problems & Solutions (3)
Current Solutions
Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts."
Information Processing & Management 49.6 (2013): 1194-1205.2
Tani et al. (2013): Summary of metadata quality approaches .
-------------------------------------------------------------------------------------------------------------------------------------------
Metadata guidelines, standard and Application Profiles
 Pros: potentially effective; if shared among organizations, they promote cross organization interoperability
 Cons: challenging to agree between different organizations; often end-up being complex combinations of features
reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them
Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)
 Pros: helpful to identify specific problems
 Cons: based on community specific criteria
Semi-automatic metadata generation approaches
 Pros: helpful to deal with the data deluge
 Cons: human assessment
Metadata cleaning, enhancement, augmentation approaches
Pros: fundamental to enable cross-community exploitation of metadata
Cons: information loss; information inconsistency
-------------------------------------------------------------------------------------------------------------------------------------------
2. The case of 840,000 cc licensed data
In Union Catalogue of Digital Archives Taiwan
“Fitness for Use” is the Key:
Data Quality (DQ) Definition for Digital Data
 Nicholas R. Chrisman (1986):
“Digital data can adapt to a broader range of
uses with a broader range of special demand,
…The root of data abuse is not in the quality
of the data, but in the awareness and
understanding of the quality of the data. By
converting to the fitness for use approach,
the problem of data abuse is moved from
producer to consumer (data user).
 W3C Data Quality Vocabulary (2015) :
“...quality lies in the eye of the beholder; that
there is no objective, ideal definition of it.
Some datasets will be judged as low-quality
resources by some data consumers, while
they will perfectly fit others' needs.
 Quality from perspectives of supply
and demand sides:
ex. Data Publishers, Certification
Agencies, Data Aggregators and Data
Consumers.
 Pragmatic
 User-specific
 Context-dependent
physical object digital object digital collection digital aggregation & publication reusing & semantic representation
Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5
Local Curation (90 projects)
DC 15 elements as the requirement
for Union Catalog
Locally developed schemes
Digital Archive Curation (1 portal)
XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle
Linked Open Data (globally linked & semantically represented)
Globally linked, machine accessible
semantics & domain knowledge vocabularies
are needed for LOD.
CONTEX I CONTEX II CONTEX III
“Fitness for Use” in different contexts:
physical object digital object digital collection digital aggregation & publication reusing & semantic representation
Digital Archive Curation (1 portal)
XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle
Linked Open Data (globally linked & semantically represented)
Provide metadata guidelines
& standard (DC 15)
Metadata Generation
Local Curation (90 projects)
Data
Quality
Data
Quality
Data
Quality
Linked Data Generation
Metadata evaluation
approaches
Semi-automatic metadata
generation approaches
Metadata
cleaning,
enhancement,
augmentation
approaches
Information Loss ?
Interpretation Problems?
Time & Resource Cost?
Problems identified in the case of 840,000 cc data
1. Confusion of Dublin Core (DC 定義混淆)
2. Name Ambiguity (名稱模糊)
3. Inconsistent Encoding (編碼不一致)
4. Semantic Overlaps (語意超載)
5. Duplicate Records (資料重複)
6. Insufficient Element Usage (語意缺漏)
7. Errors / Mistakes / Others (其它錯誤)
Considerations in the case of 840,000 cc data for LOD
1. We are not data creators. Can we
clean/revise the data “correctly”?
 Keep original CSV data open.
 Revised/Cleaned data as diff/mapping files.
2. How can we prevent “information loss”?
 Mapping activities often result in information loss.
 Reconsider the value of broken links.
3. Limited Resources & Time to handel the
clean tasks.
3. How Linked (Open Data) approach can
improve data quality?
1. Raw data, New data (cleaned data, semantically refined
data) can be benefited from Open Data Approach:
Creation of new data based on combining data.
External quality checks of data (validation).
Sustainability of data (no data loss).
The ability to merge, integrate and mesh public
and private data.
Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government."
Information Systems Management 29.4 (2012): 258-268.
2. Using SPARQL Queries to identify problems:
Identify DQ Problems before RDF generated:
 Use W3C mapping language R2RML and RDF validation
framework (RDFUnit) for mapping definitions and allow
publishers to catch & correct violations before they even
happened. (Dimou et al, 2015)
Identify DQ Problems after RDF generated:
 Using SPARQL and Public Shared LOD resources (ex. DBPedia,
Geonames)as reference to identified problems. (Furber and
Hepp, 2010)
Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge
Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.
Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic
Web-ISWC 2015. Springer International Publishing, 2015. 133-149.
Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web
Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin
Heidelberg, 2013. 141-161.
 Collaborative representation and use of quality-relevant knowledge
 Automatic identification of conflicting data requirement
 Semantic definition of data.
 Use Semantic Web data as a Trusted Reference data
 Content Integration with Ontologies
3. Use Vocabularies, Ontologies & LOD Knowledge Base :
To improve data quality for every step of a
dataset's lifecycle (ex. W3C Data Quality Vocabulary) .
To enrich data semantics and increase data
reused and refined values.
http://www.w3.org/TR/vocab-dqv/
The importance pf
provenance and
metadata quality.
(Carata, Lucian, et al.
2014)
The Story of A Fish
http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html
Parapercis kentingensis
http://URI of this Fish/6277845
2012
2015 2016
TEXT/Image
XLSX/Table/HTML
XML/HTML
CSV: (raw data published as open data)
6277845
(1)12/15 triples (statements)
Metadata(DC 15)
(2) 12/15 triples (statements)
Provence
wikidataerr
+ one “diff” triple
new
(3) Mapping replace Cleaning
+ one “time mapping” triple
time
new
err
Place information is not described in the Coverage but
Description in this stage. This should be cleaned & mapped to
external resources like Geoname and TaiwanPlaceName by us, or
by some others when time and resources are available.
(5) When the raw CSV and DC 15 represented triples
(DC 15 Version) are published, they are easily for
others to detect the errors, reused and enriched by
their own Fitness of Use and Interpretations. Even
there are errors from the beginning, more statements
about this Fish (6277845) are thus can be generated
by the interests of community.
(4) Refined Version:
semantically enriched by using domain
vocabularies like Darwin Core Terms
prov r4r schema cc odw
1. Keep original CSV data open.
2. Less clean with mapping more: revised/cleaned data as
diff/mapping files.
3. Publish the original DC 15 statements as 15 triples and
provide Provenance information.
4. Assign each item resource a URI.
5. Use domain vocabulary to enrich the resource (e.x. dwc)
6. Mapping and Linking to external databases to enrich
statements. (GenNames, TaiwanPlaceNames,
Encyclopedia of Life).
7. More errors or meanings will be stated by third parties
and crowdsourcing for their own interests.
How we clean data less through Linked (Open Data) Approach
1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16.
2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986.
3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60.
4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015.
Springer International Publishing, 2015. 133-149
5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and
Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.
6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin
Heidelberg, 2013. 141-161
7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014).
8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open
government." Information Systems Management 29.4 (2012): 258-268.
9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/
10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals."
Journal of Information Science 35.6 (2009): 727-739.
11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and
Technology 58.12 (2007): 1720-1733.
12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information
Processing & Management 49.6 (2013): 1194-1205.
13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/
14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016).
REFERENCE
Merry Christmas
Happy New Year
We will release the DC 15 Versions and the Refined Version (Biology) shortly.

More Related Content

What's hot

Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersNew York University
 
Web 3 Mark Greaves
Web 3 Mark GreavesWeb 3 Mark Greaves
Web 3 Mark GreavesMediabistro
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod GmodJun Zhao
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...aceas13tern
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013Luis Daniel Ibáñez
 
Introduction | Categories for Description of Works of Art | CDWA-LITE
Introduction | Categories for Description of Works of Art | CDWA-LITE Introduction | Categories for Description of Works of Art | CDWA-LITE
Introduction | Categories for Description of Works of Art | CDWA-LITE Kymberly Keeton
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outlineIan Duncan
 
Mappings Validation
Mappings ValidationMappings Validation
Mappings Validationandimou
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
How to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesHow to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesValeria Pesce
 
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...DuraSpace
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
 

What's hot (20)

Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
 
Web 3 Mark Greaves
Web 3 Mark GreavesWeb 3 Mark Greaves
Web 3 Mark Greaves
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Timbuctoo 2 EASY
Timbuctoo 2 EASYTimbuctoo 2 EASY
Timbuctoo 2 EASY
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013
 
Introduction | Categories for Description of Works of Art | CDWA-LITE
Introduction | Categories for Description of Works of Art | CDWA-LITE Introduction | Categories for Description of Works of Art | CDWA-LITE
Introduction | Categories for Description of Works of Art | CDWA-LITE
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Mappings Validation
Mappings ValidationMappings Validation
Mappings Validation
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
How to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesHow to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issues
 
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
 
Tese phd
Tese phdTese phd
Tese phd
 
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 

Viewers also liked

20160602 典藏目錄的語意與連結
20160602 典藏目錄的語意與連結20160602 典藏目錄的語意與連結
20160602 典藏目錄的語意與連結andrea huang
 
A preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and WikidataA preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and Wikidataandrea huang
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data GovernanceTuba Yaman Him
 
splendori si incantare
splendori si incantaresplendori si incantare
splendori si incantaresokoban
 
Inaguration of President Obama
Inaguration of President ObamaInaguration of President Obama
Inaguration of President Obamasokoban
 
三峽隆恩埔原住民族文化部落新建工程
三峽隆恩埔原住民族文化部落新建工程三峽隆恩埔原住民族文化部落新建工程
三峽隆恩埔原住民族文化部落新建工程relax.chi
 
Good shots
Good shotsGood shots
Good shotssokoban
 
Legal Aspects Of New Media 2nd Annual New Media
Legal Aspects Of New Media   2nd Annual New MediaLegal Aspects Of New Media   2nd Annual New Media
Legal Aspects Of New Media 2nd Annual New MediaPaul Jacobson
 
HAMBARUL.pps
HAMBARUL.ppsHAMBARUL.pps
HAMBARUL.ppssokoban
 
railway routes
railway routesrailway routes
railway routessokoban
 
2014-09-18 Protection of Personal Information Act readiness workshop
2014-09-18 Protection of Personal Information Act readiness workshop2014-09-18 Protection of Personal Information Act readiness workshop
2014-09-18 Protection of Personal Information Act readiness workshopPaul Jacobson
 
Work Effectively In An1
Work Effectively In An1Work Effectively In An1
Work Effectively In An1AliaSlides
 
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)guest3b97e2
 
cand turistii se amuza / when tourists have fun
cand turistii se amuza / when tourists have funcand turistii se amuza / when tourists have fun
cand turistii se amuza / when tourists have funsokoban
 
Valencia - orasul artelor si a stiintei
Valencia - orasul artelor si a stiinteiValencia - orasul artelor si a stiintei
Valencia - orasul artelor si a stiinteisokoban
 
Cel mai periculos loc turistic din lume!
Cel mai periculos loc turistic din lume!Cel mai periculos loc turistic din lume!
Cel mai periculos loc turistic din lume!sokoban
 

Viewers also liked (20)

20160602 典藏目錄的語意與連結
20160602 典藏目錄的語意與連結20160602 典藏目錄的語意與連結
20160602 典藏目錄的語意與連結
 
A preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and WikidataA preliminary study on Wikipedia Dbpdeia and Wikidata
A preliminary study on Wikipedia Dbpdeia and Wikidata
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
splendori si incantare
splendori si incantaresplendori si incantare
splendori si incantare
 
Inaguration of President Obama
Inaguration of President ObamaInaguration of President Obama
Inaguration of President Obama
 
Expo Milano 2015
Expo Milano 2015Expo Milano 2015
Expo Milano 2015
 
三峽隆恩埔原住民族文化部落新建工程
三峽隆恩埔原住民族文化部落新建工程三峽隆恩埔原住民族文化部落新建工程
三峽隆恩埔原住民族文化部落新建工程
 
Good shots
Good shotsGood shots
Good shots
 
Legal Aspects Of New Media 2nd Annual New Media
Legal Aspects Of New Media   2nd Annual New MediaLegal Aspects Of New Media   2nd Annual New Media
Legal Aspects Of New Media 2nd Annual New Media
 
HAMBARUL.pps
HAMBARUL.ppsHAMBARUL.pps
HAMBARUL.pps
 
railway routes
railway routesrailway routes
railway routes
 
Pavlov
PavlovPavlov
Pavlov
 
2014-09-18 Protection of Personal Information Act readiness workshop
2014-09-18 Protection of Personal Information Act readiness workshop2014-09-18 Protection of Personal Information Act readiness workshop
2014-09-18 Protection of Personal Information Act readiness workshop
 
Jisc rsc morris_2012
Jisc rsc morris_2012Jisc rsc morris_2012
Jisc rsc morris_2012
 
Biblioteche di ateneo e Iris
Biblioteche di ateneo e IrisBiblioteche di ateneo e Iris
Biblioteche di ateneo e Iris
 
Work Effectively In An1
Work Effectively In An1Work Effectively In An1
Work Effectively In An1
 
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
 
cand turistii se amuza / when tourists have fun
cand turistii se amuza / when tourists have funcand turistii se amuza / when tourists have fun
cand turistii se amuza / when tourists have fun
 
Valencia - orasul artelor si a stiintei
Valencia - orasul artelor si a stiinteiValencia - orasul artelor si a stiintei
Valencia - orasul artelor si a stiintei
 
Cel mai periculos loc turistic din lume!
Cel mai periculos loc turistic din lume!Cel mai periculos loc turistic din lume!
Cel mai periculos loc turistic din lume!
 

Similar to How to clean data less through Linked (Open Data) approach?

Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsMichel Dumontier
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality AssurancePéter Király
 
Metadata issues and challenges: Link Data
Metadata issues and challenges: Link DataMetadata issues and challenges: Link Data
Metadata issues and challenges: Link DataAmna Farzand Ali
 
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...Amit Sheth
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityunivTope Omitola
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Toward universal information access on the digital object cloud
Toward universal information access on the digital object cloudToward universal information access on the digital object cloud
Toward universal information access on the digital object cloudNational Institute of Informatics
 
Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]guest410707c
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
Metadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortPéter Király
 

Similar to How to clean data less through Linked (Open Data) approach? (20)

Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
STI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital WorldsSTI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital Worlds
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge Graphs
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality Assurance
 
Metadata issues and challenges: Link Data
Metadata issues and challenges: Link DataMetadata issues and challenges: Link Data
Metadata issues and challenges: Link Data
 
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
 
Big data
Big dataBig data
Big data
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Toward universal information access on the digital object cloud
Toward universal information access on the digital object cloudToward universal information access on the digital object cloud
Toward universal information access on the digital object cloud
 
Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Metadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - short
 

More from andrea huang

Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realizationandrea huang
 
結構資料的再次使用:語意、連結與實作
結構資料的再次使用:語意、連結與實作結構資料的再次使用:語意、連結與實作
結構資料的再次使用:語意、連結與實作andrea huang
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositoriesandrea huang
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museumsandrea huang
 
101203 An event ontology for crisis-disaster information
101203 An event ontology for crisis-disaster information101203 An event ontology for crisis-disaster information
101203 An event ontology for crisis-disaster informationandrea huang
 
081016 Social Tagging, Online Communication, and Peircean Semiotics
081016 Social Tagging, Online Communication, and Peircean Semiotics081016 Social Tagging, Online Communication, and Peircean Semiotics
081016 Social Tagging, Online Communication, and Peircean Semioticsandrea huang
 
060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping060817 Participation Collaboration Mapping
060817 Participation Collaboration Mappingandrea huang
 
070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorizationandrea huang
 
041018 Community Gis
041018 Community Gis041018 Community Gis
041018 Community Gisandrea huang
 
051102 Online Community Mapping
051102 Online Community Mapping051102 Online Community Mapping
051102 Online Community Mappingandrea huang
 
051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology 051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology andrea huang
 

More from andrea huang (11)

Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
 
結構資料的再次使用:語意、連結與實作
結構資料的再次使用:語意、連結與實作結構資料的再次使用:語意、連結與實作
結構資料的再次使用:語意、連結與實作
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
 
101203 An event ontology for crisis-disaster information
101203 An event ontology for crisis-disaster information101203 An event ontology for crisis-disaster information
101203 An event ontology for crisis-disaster information
 
081016 Social Tagging, Online Communication, and Peircean Semiotics
081016 Social Tagging, Online Communication, and Peircean Semiotics081016 Social Tagging, Online Communication, and Peircean Semiotics
081016 Social Tagging, Online Communication, and Peircean Semiotics
 
060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping060817 Participation Collaboration Mapping
060817 Participation Collaboration Mapping
 
070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization070928 Collaborative Geospatial Mapping And Data Authorization
070928 Collaborative Geospatial Mapping And Data Authorization
 
041018 Community Gis
041018 Community Gis041018 Community Gis
041018 Community Gis
 
051102 Online Community Mapping
051102 Online Community Mapping051102 Online Community Mapping
051102 Online Community Mapping
 
051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology 051207 Commonsense Geography Meets Web Technology
051207 Commonsense Geography Meets Web Technology
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

How to clean data less through Linked (Open Data) approach?

  • 1. How to Clean data Less through Linked (Open Data) Approach Andrea Wei-Ching Huang Institute of Information Science, Academia Sinica, Taipei, Taiwan Dec. 7 2015 @ IIS R101 1. Data Quality: data, metadata, linked data 2. The case of 840,000 cc licensed data 3. How LOD approach can improve data quality?
  • 2. 1. Data Quality: data, metadata, linked data
  • 3. Information Quality Data Quality Metadata Quality Linked Data Quality Data Quality Vocabulary Stvilia et al.(2007): 22 dimensions Batini et al. (2009): 28 dimensions Tani et al. (2013): 10 parameters Zaveri et al. (2016): 18 dimensions W3C (2015): 10 dimensions Naturalness (I) Interoperability (RP) Statistics Accessibility (R) Accessibility Accessibility Availability (A) Availability Accuracy (R) Accuracy Accuracy (S) Semantic Accuracy (I) Accuracy Accuracy/Validity (I) Applicability Pertinence Syntactic Validity (I) Appropriate amount of data Complexity (R) Clarity Precision/Completeness(R) Completeness Completeness(S) Completeness (I) Completeness Informativeness/Redundancy(R) Comprehensiveness Understandability (C) Informativeness/Redundancy(I) Conciseness Conciseness (I) Structural Consistency (I) Consistency Similarity Consistency (I) Consistency Convenience Structural Consistency(R) Correctness Verifiability (R) Credibility Trustworthiness (C) Credibility Currency (I) Currency Semantic Consistency(I) Derivation Integrity Ease of operation Processability Naturalness (R) Interactivity Conformance(S) Interlinking (A) Conformance Semantic Consistency(R) Interpretability Interpretability (RP) Precision/Completeness(I) Maintainability Preservability Complexity(I) Objectivity Relevance/ Aboutness(R) Relevancy Relevance Relevancy (C) Relevance Authority (Reputational) Reputation Security(R) Security Security (A) Speed Performance (A) Timeliness Timeliness Timeliness (C) Timeliness Traceability RP Conciseness (RP) Cohesiveness (I) Uniqueness Significance Usability Licensing (A) Volatility(R) Volatility Versatility (RP) (I): Intrinsic; (R): Relational; (S): Metadata Spec.; (RP): Representational; (A):Accessibility; (C): Contextual
  • 4. 1. Accessibility/Availability (可取得性) 2. Accuracy (正確性) 3. Completeness (完整性) 4. Consistency (一致性) 5. Credibility/Trustworthiness (可信度) 6. Relevance (相關性) 7. Timeliness (適時性) 7 dimensions/parameters are common ground Quantitative/ Qualitative Methodologies are mutual utilized.
  • 5. Metadata Quality: Problems & Solutions (1) Record Problems Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
  • 6. Metadata Quality: Problems & Solutions (2) Dublin Core Semantic Problems Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739. • Type is a subjective value. • Source is a confusing field. It is difficult to apply it consistently. • Creator can be very varied and it can be tricky determining exactly who the creator is. • The information from the publisher is vague. • Can’t define different role of contributor. • There is often great ambiguity in terms of Type and Relation. • between Format and Type. • between Creator, Publisher, and Contributor. • between Source and Relation. • The high degree of difficulty (55.3%) engendered by the Relation field o discernment of the dynamic and interrelated nature of information objects presents challenges in using the Relation element.
  • 7. Metadata Quality: Problems & Solutions (3) Current Solutions Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205.2 Tani et al. (2013): Summary of metadata quality approaches . ------------------------------------------------------------------------------------------------------------------------------------------- Metadata guidelines, standard and Application Profiles  Pros: potentially effective; if shared among organizations, they promote cross organization interoperability  Cons: challenging to agree between different organizations; often end-up being complex combinations of features reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)  Pros: helpful to identify specific problems  Cons: based on community specific criteria Semi-automatic metadata generation approaches  Pros: helpful to deal with the data deluge  Cons: human assessment Metadata cleaning, enhancement, augmentation approaches Pros: fundamental to enable cross-community exploitation of metadata Cons: information loss; information inconsistency -------------------------------------------------------------------------------------------------------------------------------------------
  • 8. 2. The case of 840,000 cc licensed data In Union Catalogue of Digital Archives Taiwan
  • 9. “Fitness for Use” is the Key: Data Quality (DQ) Definition for Digital Data  Nicholas R. Chrisman (1986): “Digital data can adapt to a broader range of uses with a broader range of special demand, …The root of data abuse is not in the quality of the data, but in the awareness and understanding of the quality of the data. By converting to the fitness for use approach, the problem of data abuse is moved from producer to consumer (data user).  W3C Data Quality Vocabulary (2015) : “...quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs.  Quality from perspectives of supply and demand sides: ex. Data Publishers, Certification Agencies, Data Aggregators and Data Consumers.  Pragmatic  User-specific  Context-dependent
  • 10. physical object digital object digital collection digital aggregation & publication reusing & semantic representation Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5 Local Curation (90 projects) DC 15 elements as the requirement for Union Catalog Locally developed schemes Digital Archive Curation (1 portal) XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle Linked Open Data (globally linked & semantically represented) Globally linked, machine accessible semantics & domain knowledge vocabularies are needed for LOD. CONTEX I CONTEX II CONTEX III “Fitness for Use” in different contexts:
  • 11. physical object digital object digital collection digital aggregation & publication reusing & semantic representation Digital Archive Curation (1 portal) XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle Linked Open Data (globally linked & semantically represented) Provide metadata guidelines & standard (DC 15) Metadata Generation Local Curation (90 projects) Data Quality Data Quality Data Quality Linked Data Generation Metadata evaluation approaches Semi-automatic metadata generation approaches Metadata cleaning, enhancement, augmentation approaches Information Loss ? Interpretation Problems? Time & Resource Cost?
  • 12. Problems identified in the case of 840,000 cc data 1. Confusion of Dublin Core (DC 定義混淆) 2. Name Ambiguity (名稱模糊) 3. Inconsistent Encoding (編碼不一致) 4. Semantic Overlaps (語意超載) 5. Duplicate Records (資料重複) 6. Insufficient Element Usage (語意缺漏) 7. Errors / Mistakes / Others (其它錯誤)
  • 13. Considerations in the case of 840,000 cc data for LOD 1. We are not data creators. Can we clean/revise the data “correctly”?  Keep original CSV data open.  Revised/Cleaned data as diff/mapping files. 2. How can we prevent “information loss”?  Mapping activities often result in information loss.  Reconsider the value of broken links. 3. Limited Resources & Time to handel the clean tasks.
  • 14. 3. How Linked (Open Data) approach can improve data quality?
  • 15. 1. Raw data, New data (cleaned data, semantically refined data) can be benefited from Open Data Approach: Creation of new data based on combining data. External quality checks of data (validation). Sustainability of data (no data loss). The ability to merge, integrate and mesh public and private data. Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268.
  • 16. 2. Using SPARQL Queries to identify problems: Identify DQ Problems before RDF generated:  Use W3C mapping language R2RML and RDF validation framework (RDFUnit) for mapping definitions and allow publishers to catch & correct violations before they even happened. (Dimou et al, 2015) Identify DQ Problems after RDF generated:  Using SPARQL and Public Shared LOD resources (ex. DBPedia, Geonames)as reference to identified problems. (Furber and Hepp, 2010) Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225. Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149.
  • 17. Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161.  Collaborative representation and use of quality-relevant knowledge  Automatic identification of conflicting data requirement  Semantic definition of data.  Use Semantic Web data as a Trusted Reference data  Content Integration with Ontologies 3. Use Vocabularies, Ontologies & LOD Knowledge Base : To improve data quality for every step of a dataset's lifecycle (ex. W3C Data Quality Vocabulary) . To enrich data semantics and increase data reused and refined values.
  • 18. http://www.w3.org/TR/vocab-dqv/ The importance pf provenance and metadata quality. (Carata, Lucian, et al. 2014)
  • 19. The Story of A Fish http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html Parapercis kentingensis
  • 20. http://URI of this Fish/6277845 2012 2015 2016 TEXT/Image XLSX/Table/HTML XML/HTML CSV: (raw data published as open data) 6277845 (1)12/15 triples (statements) Metadata(DC 15) (2) 12/15 triples (statements) Provence wikidataerr + one “diff” triple new (3) Mapping replace Cleaning + one “time mapping” triple time new err Place information is not described in the Coverage but Description in this stage. This should be cleaned & mapped to external resources like Geoname and TaiwanPlaceName by us, or by some others when time and resources are available. (5) When the raw CSV and DC 15 represented triples (DC 15 Version) are published, they are easily for others to detect the errors, reused and enriched by their own Fitness of Use and Interpretations. Even there are errors from the beginning, more statements about this Fish (6277845) are thus can be generated by the interests of community. (4) Refined Version: semantically enriched by using domain vocabularies like Darwin Core Terms prov r4r schema cc odw
  • 21. 1. Keep original CSV data open. 2. Less clean with mapping more: revised/cleaned data as diff/mapping files. 3. Publish the original DC 15 statements as 15 triples and provide Provenance information. 4. Assign each item resource a URI. 5. Use domain vocabulary to enrich the resource (e.x. dwc) 6. Mapping and Linking to external databases to enrich statements. (GenNames, TaiwanPlaceNames, Encyclopedia of Life). 7. More errors or meanings will be stated by third parties and crowdsourcing for their own interests. How we clean data less through Linked (Open Data) Approach
  • 22. 1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16. 2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986. 3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60. 4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149 5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225. 6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161 7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014). 8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268. 9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/ 10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739. 11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and Technology 58.12 (2007): 1720-1733. 12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205. 13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/ 14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62 15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016). REFERENCE
  • 23. Merry Christmas Happy New Year We will release the DC 15 Versions and the Refined Version (Biology) shortly.