Web pages increasingly embed structured data in the form of
microdata, microformats and RDFa. Through efforts such as schema.org, such embedded markup have become prevalent, with current studies estimating an adoption by about 26% of all web pages. Similar to the early adoption of Linked Data principles by publishers, libraries and other providers of bibliographic data, such organisations have been among the
early adopters, providing an unprecedented source of structured data about scholarly works. Such data, however, is fundamentally different
from traditional Linked Data, by being very sparsely linked and consisting of a large amount of coreferences and redundant statements. So far, the scale and nature of embedded scholarly data on the Web has not been investigated. In this work, we provide a study on embedded scholarly data to answer research questions about the depth, syntactic and semantic characteristics and distribution of extracted data, thereby investigating challenges and opportunities for using embedded data as a structured knowledge graph of scholarly information.
Analysing Structured Scholarly Data Embedded in Web Pages
1. Analysing Structured Scholarly Data
Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu,
Sriparna Saha and Stefan Dietze
WWW 2016
April 11th
, 2016
Montreal, Canada
3. INTRODUCTION (1/3)
The Web: nearly 46 trillion
Web pages indexed by Google
VS
Linked Data: approx. 1000
datasets & 100 billion
statements
● different order of
magnitude w.r.t. scale &
dynamics
Are there other semantics (structured facts) on the Web?
4. INTRODUCTION (2/3)
● Web pages embed structured data
(microdata, microformats and RDFa)
○ Interpretation of web documents
(search & retrieval)
● Increase in prevalence of embedded
markup (2014 Google study of 12 bn
pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al.
[ISWC’14])
○ Markup from Common Crawl (2.2 bn
pages)
○ 17 billion RDF quads
○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
7. MOTIVATION
● Embedded markup ⇒ sparsely
linked, large % of coreferences,
redundant statements
● Uptake and reuse of embedded
markup is hindered by the lack
of dynamics, scale
● Lack of understanding of the
adoption of markup for
scholarly resource metadata
8. WHAT WE BRING TO THE TABLE ...
● Study of scholarly data
extracted from embedded
annotations (Web Data
Commons)
● Shape & characteristics of
entity descriptions
● Level of adoption of terms
& types, distributions
across TLDs, PLDs, data
publishers
9. RESEARCH QUESTIONS
RQ1 What are frequently used
terms & types for scholarly data?
RQ2 How are statements about
bibliographic data distributed
across the web? Who are the key
providers of bibliographic markup?
RQ3 What are the frequent errors
that can be observed?
10. DATASET
● Web Data Commons (WDC) 2014 dataset
● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-
occuring on same document with any s:
ScholarlyArticle instance
○ 6,793,764 quads
○ 1,184,623 entities
○ 83 distinct classes
○ 429 distinct predicates
11. DATASET - Considerations
● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles
● We focus on schema.org, the most
widely used schema
● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization
○ 280,616 instances (s:
ScholarlyArticle)
○ 847,417 insrances (s:Person)
○ 3,798 instances (s:Organization)
12. SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across
extracted types
1 to 14
1 to 9 1 to 4
13. SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
24. CONCLUSIONS (½)
● First study on coverage & char. of
bibliographic metadata embedded
in web pages.
● Early adopters ⇒ publishers,
libraries, other providers of
bibliographic data.
● Usage of terms, types ⇒ dist.
across providers, domains and
topics follows a power law; few
providers & documents
contributing to majority of data.
25. ● Top-k genres & publishers indicate a
bias towards French, English data
providers.
● Article titles, PLDs & publishers ⇒
bias Computer Science and Life
Sciences.
● In this study we only consider entities
tagged explicitly as "scholarlyArticle",
a deeper analysis considering more
types (article, book, etc.) and other
creative works can shed light on the
true scale of and potential of
embedded markup data.
CONCLUSIONS (2/2)
26. FUTURE WORK
● Targeted crawl of typical
providers of scholarly data
(publishers, academic
orgs., libraries, etc.)
● Consider implicitly typed
bibliographic or creative
work as scholarly data
28. LIMITATIONS
● Our study is limited to
schema.org & the types of
s:ScholarlyArticle, s:
Person, s:Organization.
● We consider only explicitly
linked scholarly works.