1. Scaling the (evolving) web data
–at low cost-
Javier D. Fernández
QuWeDa 2017: Querying the Web of Data
Kosice, 29/05/2017
2. A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
A motivated speaker
Some knowledge in the area
An engaged audience
Slides (number at your convenience)
Method
Present yourself
Set the context, give an overall picture of the area
Touch some of the topics of the event
Focus the discussion- Sell your work
Devise future developments in the area
• Mix everything with jokes
3. About me:
since 2015 @WU, Inst. for Information Business
Research interest: Semantic Web, Open Data, Big (Semantic) Data Management,
Databases, Data Compression, Privacy and Security
https://www.wu.ac.at/en/infobiz/team/fernandez/
MadridValladolid Santiago Rome
3
Óscar CorchoPablo de la Fuente
Miguel A. Martínez-Prieto
Claudio Gutiérrez Maurizio Lenzerini
Vienna
Axel Polleres
4. A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
A motivated speaker
Some knowledge in the area
An engaged audience
Slides (number at your convenience)
Method
Present yourself
Set the context, give an overall picture of the area
Touch some of the topics of the event
Focus the discussion- Sell your work
Devise future developments in the area
• Mix everything with humour
6. The Web of Data Eco System
First, we better know what we can offer…
What is the Semantic Web/Web of Data/Linked Data?
Who are we? What have we done so far?
What we haven‘t done so far?
6
Linked Data Semantic Web
Open Data
Big Data
7. (Big Semantic Data: Linked Data vs.
Big Data)
Overlaps:
LD as a whole is big (38B-150B triples)
No rigid (e.g., relational) data model
Big Data technologies (e.g., Hadoop) are used to handle LD
LD can represent knowledge extracted from big unstructured
data (specially to deal with variety)
Key Differences:
Individual linked data sets are typically not "big" per se
(e.g., English DBpedia dump (zip) currently < 5 GB)
LD is structured, single data model (RDF), "big data lakes" are
typically neither
Big data based on distributed data infrastructures within an
organization (e.g., Hadoop clusters), LD creates a
decentralized, globally distributed data infrastructure
8. Let’s study the community…
Survey practitioner needs, technological challenges, and
open research questions on the use of Linked Data
Austrian FFG ICT of the Future project (exploratory study)
Consortium: IDC Austria, Technical University of Vienna,
University of Economy Vienna, Semantic Web Company
Project ended in Dec 2016: https://www.linked-data.at/
Standards*Requirements Literature research*
* Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis
9. Interviews
23 interviews:
Domains
Consulting, Engineering, Environment, Finance and Insurance,
Government, Healthcare, ICT, IT, Media, Pharmaceutical,
Professional Services, Real Estate, Research, Startup, Tourism,
Transports & Logistics
Roles
Business Intelligence, CEO, Chief Engineer, Data and Systems
Architect, Data Scientist, Director Information Management,
Enterprise Architect, Founder, General Secretary, Governance, Risk
& Compliance Manager, Head of Communications and Media, Head
of Development, Head of HR, Head of R&D, Innovation Manager,
Information Architect, IT Project Manager, Management, Managing
director, Marketing Analyst, Principle System Analyst, Project
Coordinator, Researcher, Technical Specialist
10. Technologies in need…
Analytics
Computational
linguistics & NLP
Concept tagging
& annotation
Data integration
Data
management
Dynamic data /
streaming
Extraction, data
mining, text
mining, entity
extraction
Logic, formal
languages &
reasoning
Human-
Computer
Interaction &
visualization
Knowledge
representation
Machine learning
Ontology/thesaur
us/taxonomy
management
Quality &
Provenance
Recommendation
Robustness,
scalability,
optimization and
performance
Searching,
browsing &
exploration
Security and
privacy
System
engineering
We ended
with most
areas of
the SW
21. Semantic Web/Linked Data over
time…
Early adopters:
MITRE
Chevron
British Telecom
Boeing
Ordnance Survey
Eli Lily
Pfizer
Agfa
Food and Drug Administration
National Institutes of Health
Software adopters/products:
Oracle
Adobe
Altova
OpenLink
TopQuadrant
Software AG
Aduna Software
Protége
SAPHIRE
24. LD Adopters - Companies
0
200
400
600
800
1000
1200
1400
1600
Google Oracle Yahoo SAP IEEE
Intelligent
Systems
Franz Bing Expert
System
IBM Research Poolparty
Occurrences
Companies
Conference Sponsors that appear in papers 2006-2015
26. Semantic Web/Linked Data over
time…
The authors claim that "early research has
transitioned into these larger, more
applied systems, today’s Semantic Web
research is changing: It builds on the
earlier foundations but it has generated a
more diverse set of pursuits”.
30. A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
A motivated speaker
Some knowledge in the area
An engaged audience
Slides (number at your convenience)
Method
Present yourself
Set the context, give an overall picture of the area
Touch some of the topics of the event
Focus the discussion- Sell your work
Devise future developments in the area
• Mix everything with humour
31. Motivation
Publication, Exchange and Consumption of large RDF datasets
Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
Verbose = High costs to write/exchange/parse
A basic offline search = (decompress)+ index the file + search
Lightweight Binary RDF (HDT)
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3%) memory footprint.
Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.
Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x).
431 M.triples~
63 GB
DBpedia
NT + gzip
5 GB
HDT
6.6 GB
HDT + gzip
2.7 GB
rdfhdt.org
35. Applications
Compress and share ready-to-consume RDF datasets
Transfer large data between servers
Embedded Systems & Phones
Fast –low cost- SPARQL Query Engine
Via LDF
HDT-Jena
HDT-Cliopatra
36. But what about Web-scale queries
E.g. retrieve all entities in LOD with the label “Tim
Berners-Lee“
Options:
Crawl and index LOD locally (-no-)
Follow-your-nose (where should I start?)
Federated querying (as good as the endpoints you query)
Use LOD Laundromat as a “good approximation” (still
querying 650K datasets)
36
select distinct ?x {
?x rdfs:label "Tim Berners-Lee"
}
43. LOD-a-lot (some use cases)
Query resolution at Web scale
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
43
subjects predicates objects
45. A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
A motivated speaker
Some knowledge in the area
An engaged audience
Slides (number at your convenience)
Method
Present yourself
Set the context, give an overall picture of the area
Touch some of the topics of the event
Focus the discussion- Sell your work
Devise future developments in the area
• Mix everything with humour
49. 49
1) Linked Open/Close Data
B) A secure LD Endpoint
ESWC’17, THU 16:30-17:00
Self-Enforcing Access Control for Encrypted RDF
Javier D. Fernández, Sabrina Kirrane, Axel Polleres and
Simon Steyskal
50. 2) RDF evolution at Scale
ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015
Number of
sources
Update rate
month
year
week
day
hour
minute
second
104 105 106101100 102 103
DBpedia
BTC
Dyldo
Internet
of Things
Virtual/Augmented
Reality
versions?
LOD-a-lot
51. Managing the Evolution and
Preservation of the Data Web (FP7)
Preserving Linked Data (FP7)
last few years:
51
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEnchmark of RDF ARchives
2) RDF evolution at Scale
52. Use mappings to update
infoboxes and track
pages that need
updating.
3) Ontology-based Data Management
Use case: Dbpedia & SPARQL Update to maintain Wikipedia?
53. Our approach to OBDM over curated sources
1. Ensure consistency in all cases, automatically resolve
updates on the best-effort basis.
2. Learn from existing data and from principled belief
revision semantics.
E.g.: many football players with only one foaf:name in
English DBpedia have both name and full name Infobox
properties set.
3. Record, extract and apply best / typical practices.
name foaf:name
full_name
A minimal-change insert translation
would only update one infobox
property.
ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and
SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov
3) Ontology-based Data Management
54. A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
A motivated speaker
Some knowledge in the area
An engaged audience
Slides (number at your convenience)
Method
Present yourself
Set the context, give an overall picture of the area
Touch some of the topics of the event
Focus the discussion- Sell your work
Devise future developments in the area
• Mix everything with humour
55. Dept. of Information Systems & Operations
Institute for Information Business
Welthandelsplatz 1, 1020 Vienna, Austria
DR. Javier D. Fernández
T +43-1-313 36-5241
F +43-1-313 36-739
jfernand@wu.ac.at
www.ai.wu.ac.at
Thanks!
Big (Semantic) Data
Versions
Evolving Data
Encryption
Compression
rdfhdt.org
Editor's Notes
After some years pushing for the Web of Data, now it should be the moment to see the ecosystem and think what have we done so far, and what we haven‘t done so far
Outlines quite clearly what they thought back then the Semantic Web should be…