This deck, presented at Connected Data London 2018 (7 Nov. 2018), looks at the past, present and possible future of schema.org.
In particular it examines to which degree schema.org has helped move us toward the web of linked data envisioned by Tim Berners-Lee, and what lessons can be learned from what, I argue, has been the successful launch of a collaboratively-developed structured data vocabulary.
2. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Electronic Arts
schema.org/worksFor
3. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
bit.ly/semsearch
schema.org
pending.schema.org/knowsAbout
bit.ly/sdataevents schema.org/WebSite
4. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
schema.org
pending.schema.org/knowsAbout
5. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
History and adoption
schema.org followed in the footsteps of other structured data initiatives, but appears to
have enjoyed much broader adoption
6. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
schema.org
Microformats (2004)
Broad search engine support
data-vocabulary.org (2009)
data-vocabulary.org
Open Graph Protocol (2007)
Partial search engine support
GoodRelations (2007)
DCMI Terms (2003)
FOAF (2000)
No explicit search engine support
Structured data existed prior to schema.org, but often with little or no search engine support
The road to schema.org
schema.org (2011)
7. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
A “collection of shared vocabularies … that can be understood by the major search engines”
schema.org in a nutshell
Structure
• A collection of schemas consisting of types, properties and
enumerations
• Types – classes and subclasses (e.g. “Book”)
• Properties – attributes expecting a value of a particular data type
(e.g. “sameAs”), or relations expecting an instance of a particular
type (e.g. “author”) or an enumeration member (e.g. “availability”)
• Enumerations – a class (e.g. “ItemAvailability) whose members
are considered neither types nor properties (e.g. “InStock”)
Search engine support
• A joint initiative supported at launch by Bing, Google and
Yahoo, and soon after by Yandex
Supported encoding formats
• Microdata and RDFa supported at launch, with RDFa Lite and
JSON-LD support following
8. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
All data from Web Data Commons
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
2012 Aug 2013 Nov 2014 Dec 2015 Nov 2016 Oct 2017 Nov
Format Use as a Percentage of Sampled Domains
RDFa Microdata JSON-LD
Robust schema.org adoption data is hard to come by, but format use helps paint the picture
schema.org adoption as inferred from Web Data Commons data
9. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
What’s currently being encoded with these syntaxes is almost exclusively schema.org
For microdata and JSON-LD, it’s schema.org all the way down
Top Classes, Microdata, Nov. 2017 Top Classes, JSON-LD, Nov. 2017
10. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
All data from Web Data Commons
Format Use by Number of Domains in Sample
Raw Web Data Commons format usage data belies the relative expressiveness of schema.org
A relatively large vocabulary results in more assertions
2012 2017
11. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Raw Web Data Commons format usage data belies the relative expressiveness of schema.org
A relatively large vocabulary results in more assertions
<span class= "author vcard">
<a href=
"http://www.seoskeptic.com/
aaron-bradley/"
class="url fn">Aaron Bradley</a>
“... OGP (Open Graph Protocol) and
microformat approaches can be found on
approximately as many sites as Schema.org,
but given their much smaller vocabularies,
they appear on less than fewer than half as
many pages and contain fewer than a quarter
as many logical assertions.”
Guha, Brickley and Macbeth, Dec. 2015
12. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Such as they are
schema.org use by the numbers
Apr. 2014 Dec. 2014 Dec. 2015 Nov. 2018
0.3% 22.0% 31.3%
21.9%
JSON-LD
15.6%
Microdata
% of domains
SearchMetrics
500K domains
Microdata only?
% of pages
Guha, Brickley, Macbeth
10B pages
% of websites
W3Techs
Top 10M websites
(Alexa)
% of pages
Guha, Brickley, Macbeth
10B pages
13. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
The path to adoption
The vocabulary launched with a clear value proposition for webmasters, and has been
buoyed since by a collaborative vocabulary development model, a modified extension
mechanism and the added flexibility afforded by JSON-LD
14. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Event
Recipe, AggregateRating
Product, AggregateRating
The search engines incentivized schema.org use right out of the gate with rich snippets
Rich results at launch
15. Rich results post-launch
The search engines have been steadily adding new search features as the vocabulary grows
Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Organization.logo, Organization.sameAs JobPosting ClaimReview
16. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
23 March 20174 May 2016
0 200 400 600 800 1000
Jun-11
Nov-15
Nov-18
Classes in schema.org, 2011-2018
Core Extensions Pending
A living vocabulary
Over the course of time schema.org has become more and more expressive
17. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
public-schemaorg W3C Mailing List
schema.org provides multiple mechanisms for collaborative vocabulary development
Making vocabulary development a community affair
schema.org on Github Partnerships
18. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
GS1’s SmartSearch is powered by a schema.org
external extension
schema.org’s extension mechanism was completely revamped in v2.0 (May 2015)
Extending schema.org with more specialized vocabulary
SmartSearch in action at Tesco
19. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
schema.org endorsed JSON-LD in 2013; Google started using it in 2014, with full support by 2016
JSON-LD: developer-friendly linked data
“…the whole point about it is, it is JSON first and RDF
second. And the fact that it carries RDF is simply
unimportant. And it's particularly unimportant to people
who are JSON users – which is basically every web
developer these days.
“People don't need to know everything, they can create
really cool applications, and if they find JSON-LD useful
– fantastic. If they don't know that it's RDF, I don't care.”
Phil Archer, Aug. 2014
20. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Separation of the data and presentation layers makes life considerably easier for web developers
JSON-LD versus inline markup: no contest
Product Details Page: Before Product Details Page: After
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Product",
"name": "Bob's Best Basic T"
"image": "bbbt-pink.jpg",
"offers": {
"@type": "Offer",
"price": "$28",
"priceCurrency": "$USD",
},
"aggregateRating": {
…
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Product",
"name": "Bob's Best Basic T"
"image": "bbbt-pink.jpg",
"offers": {
"@type": "Offer",
"price": "$28",
"priceCurrency": "$USD",
},
"aggregateRating": {
…
21. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
schema.org beyond search
Seemingly striking the right balance between expressiveness and complexity, the
vocabulary is being used for applications outside of search, and is increasingly the
starting point for ground-up linked data initiatives
22. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Pinterest uses schema.org to populate Article, Product and Recipe Rich Pins
Leveraging structured data to enhance the presentation layer
Pinterest Product Rich Pin Offer Information on Pin Source Page
23. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
When Google needed vocabulary for its Assistant it unsurprisingly turned to schema.org
Virtual assistants and schema.org
24. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
When Google needed vocabulary for its Assistant it unsurprisingly turned to schema.org
Virtual assistants and schema.org
25. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Amazon’s Alexa Meaning Representation Language is based on schema.org
Virtual assistants and schema.org
“The Alexa ontology utilized schema.org as
its base and has been updated to include
support for spoken language. In addition,
using schema.org as the base of the Alexa
Ontology means that it shares a vocabulary
used by more than 10 million websites, which
can be linked to the Alexa ontology”
Thomas Kollar et al, Jun. 2018
26. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
A New Zealand health insurance company used the vocabulary to kickstart product development
Bootstrapping development with schema.org
David Gibson, Feb. 2018
27. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
The vocabulary allows linked data practitioners to construct knowledge graphs with relative ease
Bootstrapping development with schema.org
“…the knowledge graph is implemented as a
triple store where the data has been
represented using a small number of
vocabularies (mostly schema.org with some
terms borrowed from TAXREF-LD and the
TDWG LSID vocabularies).”
Rod Page, Ozymandias
28. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Chinese search engine Baidu appears to have based its knowledge graph on schema.org
Bootstrapping development with schema.org
Via Google Translate
29. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Electronic Arts used the vocabulary as the basis for their domain ontology
Bootstrapping development with schema.org
30. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Boundaries of the vocabulary
As schema.org is adopted for use in increasingly diverse domains, there’s more and
more demands to add to the vocabulary: does it risk becoming too much “an ontology of
everything”, or is it actually not expressive enough?
31. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Is it an animal?
Just how much can we say about each entity?
Let’s play 20 questions using schema.org vocabulary!
Is it a vegetable? Is it a mineral?
It’s a Thing It’s a Thing It’s a Thing
More expressive exceptions:
Person, Product
More expressive exception:
Product
More expressive exception:
Product
32. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
But there’s always a tension between adding to schema.org and referencing existing vocabularies
The “add animals and plants” discussion has recently reignited
33. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
But there’s always a tension between adding to schema.org and referencing existing vocabularies
The “add animals and plants” discussion has recently reignited
34. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Recent developments and future
directions
At the same time that the improved ability of machines to understand content makes
structured data use less of an imperative, schema.org is increasingly finding itself useful
as a mechanism for serialized linked data
35. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
If machines are eventually able to parse content like humans will structured data still be necessary?
Will AI and related technologies render schema.org obsolete?
36. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
Leveraging schema.org allows Google to improve the discoverability of datasets
Bridging the semantic gap with Dataset Search
Year of Birth No. of cases
1976 1
1977 1
1980 1
1981 2
1982 7
1983 8
1984 7
1985 7
1986 11
…
Total 89
37. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
JSON-LD data feeds enable publishers to support user-initiated video or audio playback
Bridging the action gap with Google Media Actions
<script type="application/ld+json">
{
"@context": ["http://schema.org",
{"@language": "en"}],
"@type": "Movie",
"@id": "http://example.com/M",
"url": "http://example.com/M",
"name": “M",
"potentialAction": {
"@type": "WatchAction",
"target": {
"@type": "EntryPoint",
"urlTemplate":
"http://example.com/M?autoplay=true",
"inLanguage": "en",
"actionPlatform": [
"http://schema.org/DesktopWebPlatform",
"http://schema.org/MobileWebPlatform",
"http://schema.org/AndroidPlatform",
"http://schema.org/IOSPlatform",
"http://schema.googleapis.com/GoogleVideoCa
st"
]
…
38. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
This Google tool supports direct entry of ClaimReview data, which then appears on dataCommons.org
Bridging the markup gap with the Fact Check Markup Tool
...
"@type" : "DataFeedItem",
"dateModified" : "2018-10-24T15:00:14.238315+00:00",
"item" :
[
{
"@context" : "schema.org",
"@type" : "ClaimReview",
"author" :
{
"@type" : "Organization",
"name" : "Sens3",
"url" : "http://fct.sens3.com/"
},
"claimReviewed" : "I play the trumpet!",
"datePublished" : "2018-10-09",
"itemReviewed" :
{
"@type" : "Claim",
"author" :
{
"@type" : "Person",
"name" : "Paul McCartney"
}
},
"reviewRating" :
...
39. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
This Google tool supports direct entry of ClaimReview data, which then appears on dataCommons.org
Bridging the markup gap with the Fact Check Markup Tool
...
"@type" : "DataFeedItem",
"dateModified" : "2018-10-24T15:00:14.238315+00:00",
"item" :
[
{
"@context" : "schema.org",
"@type" : "ClaimReview",
"author" :
{
"@type" : "Organization",
"name" : "Sens3",
"url" : "http://fct.sens3.com/"
},
"claimReviewed" : "I play the trumpet!",
"datePublished" : "2018-10-09",
"itemReviewed" :
{
"@type" : "Claim",
"author" :
{
"@type" : "Person",
"name" : "Paul McCartney"
}
},
"reviewRating" :
...
"@type": "Rating",
"ratingValue": “2",
"alternateName" : “Mostly False",
"bestRating": "5",
"worstRating": "1“
40. Aaron Bradley, Connected Data London 2018 ▪ #CDL2018 ▪ @aaranged
schema.org has established common ground on shared terminology: is it time to address identifiers?
Questions of identity
“Very early in the formation of schema.org we made a strong decision, which was not
to support canonical IDs, and I think it was an important thing because it would have
been very politically contentious at the time to support it, because we basically would
have had to pick somebody's ID system to have canonical IDs.
“I think the time has come for canonical IDs, so I would love to see schema.org or
some other organization take on canonical IDs.”
Steve Macbeth, Microsoft, Apr. 2018