Analysing Structured Scholarly Data Embedded in Web Pages

Analysing Structured Scholarly Data
Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu,
Sriparna Saha and Stefan Dietze
WWW 2016
April 11th
, 2016
Montreal, Canada

OVERVIEW
❏ INTRODUCTION
❏ MOTIVATION
❏ RESEARCH
QUESTIONS
❏ ANALYSES
❏ CONCLUSIONS
❏ FUTURE WORK

INTRODUCTION (1/3)
The Web: nearly 46 trillion
Web pages indexed by Google
VS
Linked Data: approx. 1000
datasets & 100 billion
statements
● different order of
magnitude w.r.t. scale &
dynamics
Are there other semantics (structured facts) on the Web?

INTRODUCTION (2/3)
● Web pages embed structured data
(microdata, microformats and RDFa)
○ Interpretation of web documents
(search & retrieval)
● Increase in prevalence of embedded
markup (2014 Google study of 12 bn
pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al.
[ISWC’14])
○ Markup from Common Crawl (2.2 bn
pages)
○ 17 billion RDF quads
○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)

Other semantics
(structured facts) on
the Web!

INTRODUCTION (3/3)
Characteristics of Markup Data

MOTIVATION
● Embedded markup ⇒ sparsely
linked, large % of coreferences,
redundant statements
● Uptake and reuse of embedded
markup is hindered by the lack
of dynamics, scale
● Lack of understanding of the
adoption of markup for
scholarly resource metadata

WHAT WE BRING TO THE TABLE ...
● Study of scholarly data
extracted from embedded
annotations (Web Data
Commons)
● Shape & characteristics of
entity descriptions
● Level of adoption of terms
& types, distributions
across TLDs, PLDs, data
publishers

RESEARCH QUESTIONS
RQ1 What are frequently used
terms & types for scholarly data?
RQ2 How are statements about
bibliographic data distributed
across the web? Who are the key
providers of bibliographic markup?
RQ3 What are the frequent errors
that can be observed?

DATASET
● Web Data Commons (WDC) 2014 dataset
● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-
occuring on same document with any s:
ScholarlyArticle instance
○ 6,793,764 quads
○ 1,184,623 entities
○ 83 distinct classes
○ 429 distinct predicates

DATASET - Considerations
● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles
● We focus on schema.org, the most
widely used schema
● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization
○ 280,616 instances (s:
ScholarlyArticle)
○ 847,417 insrances (s:Person)
○ 3,798 instances (s:Organization)

SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across
extracted types
1 to 14
1 to 9 1 to 4

SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle

DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs

Top-10 PLDs (ranked by no. of entities)

Distribution of Entities & Statements across TLDs

Distribution of Entities & Statements across HTML
Documents

Top-10 Documents Ranked According to
Embedded Entities

TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers

Top-10 Publishers and corresponding no. of
Publications

Top-10 Publication Types (genres) across WDC

Top-10 Article Titles (ranked by frequency of occurrence)

FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates

CONCLUSIONS (½)
● First study on coverage & char. of
bibliographic metadata embedded
in web pages.
● Early adopters ⇒ publishers,
libraries, other providers of
bibliographic data.
● Usage of terms, types ⇒ dist.
across providers, domains and
topics follows a power law; few
providers & documents
contributing to majority of data.

● Top-k genres & publishers indicate a
bias towards French, English data
providers.
● Article titles, PLDs & publishers ⇒
bias Computer Science and Life
Sciences.
● In this study we only consider entities
tagged explicitly as "scholarlyArticle",
a deeper analysis considering more
types (article, book, etc.) and other
creative works can shed light on the
true scale of and potential of
embedded markup data.
CONCLUSIONS (2/2)

FUTURE WORK
● Targeted crawl of typical
providers of scholarly data
(publishers, academic
orgs., libraries, etc.)
● Consider implicitly typed
bibliographic or creative
work as scholarly data

Contact Details :
gadiraju@l3s.de
http://www.L3S.de

LIMITATIONS
● Our study is limited to
schema.org & the types of
s:ScholarlyArticle, s:
Person, s:Organization.
● We consider only explicitly
linked scholarly works.

Analysing Structured Scholarly Data Embedded in Web Pages

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Analysing Structured Scholarly Data Embedded in Web Pages

Similar to Analysing Structured Scholarly Data Embedded in Web Pages (20)

Recently uploaded

Recently uploaded (20)

Analysing Structured Scholarly Data Embedded in Web Pages