SlideShare a Scribd company logo
1 of 55
Download to read offline
Scaling the (evolving) web data
–at low cost-
Javier D. Fernández
QuWeDa 2017: Querying the Web of Data
Kosice, 29/05/2017
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with jokes
About me:
 since 2015 @WU, Inst. for Information Business
Research interest: Semantic Web, Open Data, Big (Semantic) Data Management,
Databases, Data Compression, Privacy and Security
 https://www.wu.ac.at/en/infobiz/team/fernandez/
MadridValladolid Santiago Rome
3
Óscar CorchoPablo de la Fuente
Miguel A. Martínez-Prieto
Claudio Gutiérrez Maurizio Lenzerini
Vienna
Axel Polleres
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
5
The Web of Data Eco System
The Web of Data Eco System
 First, we better know what we can offer…
 What is the Semantic Web/Web of Data/Linked Data?
 Who are we? What have we done so far?
 What we haven‘t done so far?
6
Linked Data Semantic Web
Open Data
Big Data
(Big Semantic Data: Linked Data vs.
Big Data)
 Overlaps:
 LD as a whole is big (38B-150B triples)
 No rigid (e.g., relational) data model
 Big Data technologies (e.g., Hadoop) are used to handle LD
 LD can represent knowledge extracted from big unstructured
data (specially to deal with variety)
 Key Differences:
 Individual linked data sets are typically not "big" per se
(e.g., English DBpedia dump (zip) currently < 5 GB)
 LD is structured, single data model (RDF), "big data lakes" are
typically neither
 Big data based on distributed data infrastructures within an
organization (e.g., Hadoop clusters), LD creates a
decentralized, globally distributed data infrastructure
Let’s study the community…
Survey practitioner needs, technological challenges, and
open research questions on the use of Linked Data
 Austrian FFG ICT of the Future project (exploratory study)
 Consortium: IDC Austria, Technical University of Vienna,
University of Economy Vienna, Semantic Web Company
 Project ended in Dec 2016: https://www.linked-data.at/
Standards*Requirements Literature research*
* Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis
Interviews
 23 interviews:
 Domains
 Consulting, Engineering, Environment, Finance and Insurance,
Government, Healthcare, ICT, IT, Media, Pharmaceutical,
Professional Services, Real Estate, Research, Startup, Tourism,
Transports & Logistics
 Roles
 Business Intelligence, CEO, Chief Engineer, Data and Systems
Architect, Data Scientist, Director Information Management,
Enterprise Architect, Founder, General Secretary, Governance, Risk
& Compliance Manager, Head of Communications and Media, Head
of Development, Head of HR, Head of R&D, Innovation Manager,
Information Architect, IT Project Manager, Management, Managing
director, Marketing Analyst, Principle System Analyst, Project
Coordinator, Researcher, Technical Specialist
Technologies in need…
Analytics
Computational
linguistics & NLP
Concept tagging
& annotation
Data integration
Data
management
Dynamic data /
streaming
Extraction, data
mining, text
mining, entity
extraction
Logic, formal
languages &
reasoning
Human-
Computer
Interaction &
visualization
Knowledge
representation
Machine learning
Ontology/thesaur
us/taxonomy
management
Quality &
Provenance
Recommendation
Robustness,
scalability,
optimization and
performance
Searching,
browsing &
exploration
Security and
privacy
System
engineering
We ended
with most
areas of
the SW
Standards
Standards Toolbox (incl. W3C member submissions)
What can we offer?
Community Analysis
 Monitoring SW community major venues (2006-2015):
 ISWC (since 2006), ESWC (since 2006), SEMANTiCS (since
2007), JWS (since 2006), SWJ (since 2010)
 3 seminal papers:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Topic Categorisation
Topic Categorisation
Interestingly, the
same “empty”
topics in
standards
Semantic Web/Linked Data over
time…
Subtopics:
Expressing Meaning
Knowledge Representation
Ontologies
Agents
Knowledge Representation
& Reasoning
Semantic Web/Linked Data over
time…
Early adopters:
MITRE
Chevron
British Telecom
Boeing
Ordnance Survey
Eli Lily
Pfizer
Agfa
Food and Drug Administration
National Institutes of Health
Software adopters/products:
Oracle
Adobe
Altova
OpenLink
TopQuadrant
Software AG
Aduna Software
Protége
SAPHIRE
LD Adopters - Companies
LD Adopters - Companies
LD Adopters - Companies
0
200
400
600
800
1000
1200
1400
1600
Google Oracle Yahoo SAP IEEE
Intelligent
Systems
Franz Bing Expert
System
IBM Research Poolparty
Occurrences
Companies
Conference Sponsors that appear in papers 2006-2015
To whom we can sell our technology
Semantic Web/Linked Data over
time…
The authors claim that "early research has
transitioned into these larger, more
applied systems, today’s Semantic Web
research is changing: It builds on the
earlier foundations but it has generated a
more diverse set of pursuits”.
Big Semantic Data and applied
systems
Big Semantic Data and applied
systems
Other topics of the QuWeDa
workshop
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
Motivation
 Publication, Exchange and Consumption of large RDF datasets
 Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
 Verbose = High costs to write/exchange/parse
 A basic offline search = (decompress)+ index the file + search
 Lightweight Binary RDF (HDT)
 Highly compact serialization of RDF
 Allows fast RDF retrieval in compressed space (without prior decompression)
 Includes internal indexes to solve basic queries with small (3%) memory footprint.
 Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.
 Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x).
431 M.triples~
63 GB
DBpedia
NT + gzip
5 GB
HDT
6.6 GB
HDT + gzip
2.7 GB
rdfhdt.org
The real motivation
The real motivation
http://www.kunsan.af.mil/News/
Article/413995/serving-the-masses/
Oh man I’m hungry and
I don’ t even know if I
will like whatever you
are cooking
The real motivation
http://www.kunsan.af.mil/News/
Article/413995/serving-the-masses/
Oh man I’m hungry and
I don’ t even know if I
will like whatever you
are cooking
consume
Applications
 Compress and share ready-to-consume RDF datasets
 Transfer large data between servers
 Embedded Systems & Phones
 Fast –low cost- SPARQL Query Engine
 Via LDF
 HDT-Jena
 HDT-Cliopatra
But what about Web-scale queries
 E.g. retrieve all entities in LOD with the label “Tim
Berners-Lee“
 Options:
 Crawl and index LOD locally (-no-)
 Follow-your-nose (where should I start?)
 Federated querying (as good as the endpoints you query)
 Use LOD Laundromat as a “good approximation” (still
querying 650K datasets)
36
select distinct ?x {
?x rdfs:label "Tim Berners-Lee"
}
37
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
SPARQL
endpoint
(metadata)
LOD Laundromat
But what about Web-scale queries
38
LOD-a-lot
- flashforward -
But what about Web-scale queries
But one could be really hungry
39
https://hwy55burgers.wordpress.com/tag/food-challenge/
LOD-a-lot
40
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
LOD-a-lot
SPARQL
endpoint
(metadata)
LOD-a-lot
Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias
28B triples
LOD-a-lot (some numbers)
Disk size:
 HDT: 304 GB
 HDT-FoQ (additional indexes): 133 GB
Memory footprint (to query):
 15.7 GB of RAM (3% of the size)
 144 seconds loading time
 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
LDF page resolution in milliseconds.
41
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
42
LOD-a-lot
https://datahub.io/dataset/lod-a-lot
LOD-a-lot (some use cases)
 Query resolution at Web scale
 Evaluation and Benchmarking
 No excuse 
 RDF metrics and analytics
43
subjects predicates objects
LOD-a-lot (ACKs)
44
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
G3b
G1b
Linked Open Data
Cloud
Linked Closed Data
Cloud
dbpedia
G3a G4a
G1a G2a
G1c G2c
G2b
1) Linked Open/Close Data
“Deep Semantic Web”
1) Linked Open/Close Data
1) Linked Open/Close Data
 A) Exchange: Encryption + HDT (hdtcrypt)
48
49
1) Linked Open/Close Data
 B) A secure LD Endpoint
ESWC’17, THU 16:30-17:00
Self-Enforcing Access Control for Encrypted RDF
Javier D. Fernández, Sabrina Kirrane, Axel Polleres and
Simon Steyskal
2) RDF evolution at Scale
ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015
Number of
sources
Update rate
month
year
week
day
hour
minute
second
104 105 106101100 102 103
DBpedia
BTC
Dyldo
Internet
of Things
Virtual/Augmented
Reality
versions?
LOD-a-lot
Managing the Evolution and
Preservation of the Data Web (FP7)
Preserving Linked Data (FP7)
last few years:
51
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEnchmark of RDF ARchives
2) RDF evolution at Scale
Use mappings to update
infoboxes and track
pages that need
updating.
3) Ontology-based Data Management
Use case: Dbpedia & SPARQL Update to maintain Wikipedia?
Our approach to OBDM over curated sources
1. Ensure consistency in all cases, automatically resolve
updates on the best-effort basis.
2. Learn from existing data and from principled belief
revision semantics.
 E.g.: many football players with only one foaf:name in
English DBpedia have both name and full name Infobox
properties set.
3. Record, extract and apply best / typical practices.
name foaf:name
full_name
A minimal-change insert translation
would only update one infobox
property.
ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and
SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov
3) Ontology-based Data Management
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
Dept. of Information Systems & Operations
Institute for Information Business
Welthandelsplatz 1, 1020 Vienna, Austria
DR. Javier D. Fernández
T +43-1-313 36-5241
F +43-1-313 36-739
jfernand@wu.ac.at
www.ai.wu.ac.at
Thanks!
 Big (Semantic) Data
 Versions
 Evolving Data
 Encryption
 Compression
rdfhdt.org

More Related Content

What's hot

Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding FormJakob .
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFSNilesh Wagmare
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFOpenLink Software
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetAlexandre Rademaker
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolLaura Po
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data VisualizationLaura Po
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." Avalon Media System
 
morph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementationmorph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementationNandana Mihindukulasooriya
 

What's hot (16)

NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
 
Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding Form
 
5 rdfs
5 rdfs5 rdfs
5 rdfs
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDF
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX tool
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World."
 
morph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementationmorph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementation
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Fedora Migration Considerations
Fedora Migration ConsiderationsFedora Migration Considerations
Fedora Migration Considerations
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 

Similar to Scaling the (evolving) web data –at low cost-

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projectszsrlibrary
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Swapnaja Tandale
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 

Similar to Scaling the (evolving) web data –at low cost- (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Hadoop
HadoopHadoop
Hadoop
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
The Future of LOD
The Future of LODThe Future of LOD
The Future of LOD
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Scaling the (evolving) web data –at low cost-

  • 1. Scaling the (evolving) web data –at low cost- Javier D. Fernández QuWeDa 2017: Querying the Web of Data Kosice, 29/05/2017
  • 2. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with jokes
  • 3. About me:  since 2015 @WU, Inst. for Information Business Research interest: Semantic Web, Open Data, Big (Semantic) Data Management, Databases, Data Compression, Privacy and Security  https://www.wu.ac.at/en/infobiz/team/fernandez/ MadridValladolid Santiago Rome 3 Óscar CorchoPablo de la Fuente Miguel A. Martínez-Prieto Claudio Gutiérrez Maurizio Lenzerini Vienna Axel Polleres
  • 4. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 5. 5 The Web of Data Eco System
  • 6. The Web of Data Eco System  First, we better know what we can offer…  What is the Semantic Web/Web of Data/Linked Data?  Who are we? What have we done so far?  What we haven‘t done so far? 6 Linked Data Semantic Web Open Data Big Data
  • 7. (Big Semantic Data: Linked Data vs. Big Data)  Overlaps:  LD as a whole is big (38B-150B triples)  No rigid (e.g., relational) data model  Big Data technologies (e.g., Hadoop) are used to handle LD  LD can represent knowledge extracted from big unstructured data (specially to deal with variety)  Key Differences:  Individual linked data sets are typically not "big" per se (e.g., English DBpedia dump (zip) currently < 5 GB)  LD is structured, single data model (RDF), "big data lakes" are typically neither  Big data based on distributed data infrastructures within an organization (e.g., Hadoop clusters), LD creates a decentralized, globally distributed data infrastructure
  • 8. Let’s study the community… Survey practitioner needs, technological challenges, and open research questions on the use of Linked Data  Austrian FFG ICT of the Future project (exploratory study)  Consortium: IDC Austria, Technical University of Vienna, University of Economy Vienna, Semantic Web Company  Project ended in Dec 2016: https://www.linked-data.at/ Standards*Requirements Literature research* * Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis
  • 9. Interviews  23 interviews:  Domains  Consulting, Engineering, Environment, Finance and Insurance, Government, Healthcare, ICT, IT, Media, Pharmaceutical, Professional Services, Real Estate, Research, Startup, Tourism, Transports & Logistics  Roles  Business Intelligence, CEO, Chief Engineer, Data and Systems Architect, Data Scientist, Director Information Management, Enterprise Architect, Founder, General Secretary, Governance, Risk & Compliance Manager, Head of Communications and Media, Head of Development, Head of HR, Head of R&D, Innovation Manager, Information Architect, IT Project Manager, Management, Managing director, Marketing Analyst, Principle System Analyst, Project Coordinator, Researcher, Technical Specialist
  • 10. Technologies in need… Analytics Computational linguistics & NLP Concept tagging & annotation Data integration Data management Dynamic data / streaming Extraction, data mining, text mining, entity extraction Logic, formal languages & reasoning Human- Computer Interaction & visualization Knowledge representation Machine learning Ontology/thesaur us/taxonomy management Quality & Provenance Recommendation Robustness, scalability, optimization and performance Searching, browsing & exploration Security and privacy System engineering We ended with most areas of the SW
  • 12. Standards Toolbox (incl. W3C member submissions)
  • 13.
  • 14.
  • 15.
  • 16. What can we offer? Community Analysis  Monitoring SW community major venues (2006-2015):  ISWC (since 2006), ESWC (since 2006), SEMANTiCS (since 2007), JWS (since 2006), SWJ (since 2010)  3 seminal papers: 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
  • 18. Topic Categorisation Interestingly, the same “empty” topics in standards
  • 19. Semantic Web/Linked Data over time… Subtopics: Expressing Meaning Knowledge Representation Ontologies Agents
  • 21. Semantic Web/Linked Data over time… Early adopters: MITRE Chevron British Telecom Boeing Ordnance Survey Eli Lily Pfizer Agfa Food and Drug Administration National Institutes of Health Software adopters/products: Oracle Adobe Altova OpenLink TopQuadrant Software AG Aduna Software Protége SAPHIRE
  • 22. LD Adopters - Companies
  • 23. LD Adopters - Companies
  • 24. LD Adopters - Companies 0 200 400 600 800 1000 1200 1400 1600 Google Oracle Yahoo SAP IEEE Intelligent Systems Franz Bing Expert System IBM Research Poolparty Occurrences Companies Conference Sponsors that appear in papers 2006-2015
  • 25. To whom we can sell our technology
  • 26. Semantic Web/Linked Data over time… The authors claim that "early research has transitioned into these larger, more applied systems, today’s Semantic Web research is changing: It builds on the earlier foundations but it has generated a more diverse set of pursuits”.
  • 27. Big Semantic Data and applied systems
  • 28. Big Semantic Data and applied systems
  • 29. Other topics of the QuWeDa workshop
  • 30. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 31. Motivation  Publication, Exchange and Consumption of large RDF datasets  Most RDF formats (N3, XML, Turtle) are text serializations, designed for human readability (not for machines)  Verbose = High costs to write/exchange/parse  A basic offline search = (decompress)+ index the file + search  Lightweight Binary RDF (HDT)  Highly compact serialization of RDF  Allows fast RDF retrieval in compressed space (without prior decompression)  Includes internal indexes to solve basic queries with small (3%) memory footprint.  Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.  Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x). 431 M.triples~ 63 GB DBpedia NT + gzip 5 GB HDT 6.6 GB HDT + gzip 2.7 GB rdfhdt.org
  • 33. The real motivation http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/ Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking
  • 34. The real motivation http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/ Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking consume
  • 35. Applications  Compress and share ready-to-consume RDF datasets  Transfer large data between servers  Embedded Systems & Phones  Fast –low cost- SPARQL Query Engine  Via LDF  HDT-Jena  HDT-Cliopatra
  • 36. But what about Web-scale queries  E.g. retrieve all entities in LOD with the label “Tim Berners-Lee“  Options:  Crawl and index LOD locally (-no-)  Follow-your-nose (where should I start?)  Federated querying (as good as the endpoints you query)  Use LOD Laundromat as a “good approximation” (still querying 650K datasets) 36 select distinct ?x { ?x rdfs:label "Tim Berners-Lee" }
  • 38. But what about Web-scale queries 38 LOD-a-lot - flashforward -
  • 39. But what about Web-scale queries But one could be really hungry 39 https://hwy55burgers.wordpress.com/tag/food-challenge/ LOD-a-lot
  • 40. 40 LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data LOD-a-lot SPARQL endpoint (metadata) LOD-a-lot Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias 28B triples
  • 41. LOD-a-lot (some numbers) Disk size:  HDT: 304 GB  HDT-FoQ (additional indexes): 133 GB Memory footprint (to query):  15.7 GB of RAM (3% of the size)  144 seconds loading time  8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS LDF page resolution in milliseconds. 41 305€ (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
  • 43. LOD-a-lot (some use cases)  Query resolution at Web scale  Evaluation and Benchmarking  No excuse   RDF metrics and analytics 43 subjects predicates objects
  • 45. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 46. G3b G1b Linked Open Data Cloud Linked Closed Data Cloud dbpedia G3a G4a G1a G2a G1c G2c G2b 1) Linked Open/Close Data “Deep Semantic Web”
  • 48. 1) Linked Open/Close Data  A) Exchange: Encryption + HDT (hdtcrypt) 48
  • 49. 49 1) Linked Open/Close Data  B) A secure LD Endpoint ESWC’17, THU 16:30-17:00 Self-Enforcing Access Control for Encrypted RDF Javier D. Fernández, Sabrina Kirrane, Axel Polleres and Simon Steyskal
  • 50. 2) RDF evolution at Scale ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015 Number of sources Update rate month year week day hour minute second 104 105 106101100 102 103 DBpedia BTC Dyldo Internet of Things Virtual/Augmented Reality versions? LOD-a-lot
  • 51. Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7) last few years: 51 Research projects Archives Tools Benchmarking one of the fundamental problems in the Web of Data BEnchmark of RDF ARchives 2) RDF evolution at Scale
  • 52. Use mappings to update infoboxes and track pages that need updating. 3) Ontology-based Data Management Use case: Dbpedia & SPARQL Update to maintain Wikipedia?
  • 53. Our approach to OBDM over curated sources 1. Ensure consistency in all cases, automatically resolve updates on the best-effort basis. 2. Learn from existing data and from principled belief revision semantics.  E.g.: many football players with only one foaf:name in English DBpedia have both name and full name Infobox properties set. 3. Record, extract and apply best / typical practices. name foaf:name full_name A minimal-change insert translation would only update one infobox property. ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov 3) Ontology-based Data Management
  • 54. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 55. Dept. of Information Systems & Operations Institute for Information Business Welthandelsplatz 1, 1020 Vienna, Austria DR. Javier D. Fernández T +43-1-313 36-5241 F +43-1-313 36-739 jfernand@wu.ac.at www.ai.wu.ac.at Thanks!  Big (Semantic) Data  Versions  Evolving Data  Encryption  Compression rdfhdt.org

Editor's Notes

  1. After some years pushing for the Web of Data, now it should be the moment to see the ecosystem and think what have we done so far, and what we haven‘t done so far
  2. Outlines quite clearly what they thought back then the Semantic Web should be…
  3. LEDS:Linked Enterprise Data Services