SlideShare a Scribd company logo
1 of 57
Cognitive Graph Analytics
on Company Data and News
Webinar, Mar 2018
Popularity ranking, importance and similarity
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
Technology Excellence Delivered
o Unique technology mix: GraphDBTM engine + Text mining
o Robust technology: We power BBC.CO.UK/SPORT and FT.COM
o We serve many of the most knowledge intensive enterprises:
FT
Financial
Times
Houses of Parliament
Information & Communications Technology
Linking Text to Knowledge Graphs
1. Integrate relevant data from many sources
 Build a Big Knowledge Graph from proprietary databases and
taxonomies, integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships
 Performing reasoning across data from different sources
3. Link text reference to a big Knowledge Graphs
 Using text-mining to automatically discover references to concepts and entities
4. Use GraphDB for metadata management, querying and search
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
Cognitive Systems: Requirements
Enterprise cognitive systems should:
1. Know the key concepts and entities in the domain
2. Be able to recognize in text mentions of entities
3. Be able to reconciliate entities across data sources and systems
4. Combine global intelligence with proprietary data and perspective
5. Evolve and learn, without cognitive expert engineering
Cognitive Systems: Motivation
Adequate business decisions require global information!
 Analytics cannot deliver deep insights only based on proprietary data
 Broader context and signals are needed
Cognitive Systems: Motivation
Developing from scratch cognitive
system with global knowledge is
unfeasible for most of the enterprises
 We don’t grow and educate people tailor-made to
serve single enterprise, right?
Cognitive Reading Machines
o Knowledge and awareness
 Factual information about concepts, entities, events , …
o Reading capabilities:
 Lexical and grammatical knowledge to comprehend text
 Recognize concepts, entities and events
o Resolving ambiguity: whether “Paris” refers to "Paris, France“, “Paris, TX” or "Paris Hilton"
o Learning capabilities: learn new facts from text
 Read and analyze information in order to extend their knowledge
o Get better entity profiles and relations
 Get better in recognizing concepts and entities
Context and Awareness
o Context allows concepts to be identified, the way people do
o Sufficiently big knowledge graph provides context for the entity
 Differentiating features and similar nodes
 How important and how popular it is
 Related entities and concepts
 Entities it is typically mentioned together with (co-occurrence)
o This is awareness!
o The kind of knowledge that people mean saying "I am aware of X"
or "She is cognizant of Y"
The Critical Mass
Malcolm Gladwell claims that one needs
to devote 10 000 hours to become an
expert in something, e.g. violin or hokey
(Outliers)
The Critical Mass
o A cognitive system needs:
 To know 1B facts
 About 100M concepts and entities
 Read 1M news articles
o In order to reach concept and entity
awareness in a specific domain
 The level of awareness that people mean saying
“My background is X”
Let’s play an Awareness game!
o Important airports near London?
o The most popular banks in US?
o Companies similar to Google?
o People mentioned together with IBM in news?
We are getting closer!
o Our Business Knowledge Model can already answer many of
these questions better than you
o Most of this intelligence is available in the Ontotext Platform
o Knowledge model = KG + text mining + analytics
o We already offer two such knowledge models:
 Business and general news: one for processing general business master data (like people,
organizations, locations and their mentions in the news)
 Life sciences and healthcare
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
o POL data is the most common type of master/reference data
 Considering business applications and news
o Open POL data is available in vast quantities
 Geonames covers locations exhaustively; DBPedia covers well popular POL entities; Wikidata, …
 Open company data grows: OpenCorporates, GLEI, open national registers, various “data leaks”
o Within 3 years exhaustive global POL data will be commodity!
 And they will be widely used for BI and decision making
o Ontotext delivers Global POL data solutions today.
 We aim to make them more affordable with more cognitive analytics
Person, Organization, Location (POL) Data
Oct
2016
Company Data Species (1/2)
Category Representatives Size (Orgs.)
Exhaustive Global Databases Dun & Bradstreet, BvD, Factset > 200M
Rich Company Databases Capital IQ (S&P), Thomson Reuters (various) 5-10M
Investment Databases CrunchBase, PitchBook, CBI, DJ Venture Source 200-600K
Very Big Open Databases OpenCorporates 130M
Global Official Open Databases GLEI (Global Legal Identifier), EU BRIS 1-30M
Open Encyclopedic DBPedia, Wikidata 0.3-1.2M
Open Leaks and Investigations Panama Papers (Offshore Leaks), Trump World Data 3-300K
Oct
2016
Company Data Species (2/2)
Category Loca
tions
Industry
Classi-
fication
High
Tech.
Fields
Invest.
Info
Org-Org
Relations
(e.g.Tree)
Org-
Person
Relations
Clean,
Correct,
Predictable
Exhaustive Global Databases ++ +/- - - ++ +/- 6
Rich Company Databases ++ + +/- +/- ++ +/- 8
Investment Databases +/- +/- + + ++/- +/- 4-6
Very Big Open Databases + +/- - - +/- - 8
Global Official Open Databases + - - - +/- - 8
Open Encyclopedic +/- +/- + - +/- + 3-5
Open Leaks and Investigations +/- - - - +/- + 4-6
o Integrate European company and economic data
o euBusinessGraph will overcome barriers in company data provisioning
o Technology and research partners: SINTEF (coord.), Ontotext, IJS, Uni. Milano
Matching and Overlap
o Organizations matched across:
CrunchBase (CB), CB Insights
(CBI), Capital IQ (CIQ), DJ
Venture Source, …
o The Venn diagram presents
the overlap between sources
 The size of the circle indicates
number of entities per source
 The level of overlap indicates number
of entities matched between the two
sources
Oct
2016
Slice and Dice
by Data Source,
Industry and
High-tech Field
Oct
2016
Data Consolidation Across Data Sources (1/2)
Oct
2016
Data Consolidation Across Data Sources (2/2)
Entity Matching Across Datasets
o Match IDs of one of the same real entity across different databases
o Data Challenges
 Different schemata
 Name variations
 Different classifications and codes
 Lack of unique identifiers (even ticker symbols are not unique)
o Technology challenges
 Pre-selection is needed; brute-force matching is not good for 1M against 5M companies
 It is not trivial to come up with good pre-selection mechanism
Company Matching Methodology
o Load all datasets in a single knowledge graph
 Schema mapping: RDF-ize all datasets using an unified ontology
o Map the industry classifications used across the sources
o Map all locations to Geonames
o Pre-selection: use FTS on names and descriptions
 GraphDB’s connectors for Lucene or ElasticSearch work perfect for this
o Develop Golden Standard (positive and negative examples)
o Calculate matching scores based on structured criteria
Company Matching Sample Project
o We can matched 5+ big datasets within couple of months
o Fully automated procedure, which takes few hours to execute
 90% SPARQL and GraphDB’s FTS connectors
o One can get to about 85% F-Score with simple structural matching
rules with weights
o To get higher accuracy, you need:
 Massive amount of manual work and fine-tuning of weights … or
 Cognitive analytics (importance, similarity, highly accurate named entity recognition, etc.)
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
FactForge: Data Integration
o DBpedia (the English version) 496M
o Geonames (all geographic features on Earth) 150M
o owl:sameAs links between DBpedia and Geonames 471K
o GLEI (global company register data) 3M
o Panama Papers DB (#LinkedLeaks) 20M
o Other datasets and ontologies: WordNet, WorldFacts, FIBO
o News metadata (2000 articles/day enriched by NOW) 673M
o Total size (1.8B explicit + 328M inferred statements) 2 168М
News Metadata
o Metadata from Ontotext’s Dynamic Semantic Publishing platform
 News stream from Google
 Automatically generated as part of the NOW.ontotext.com semantic news showcase
o News stream from Google since Feb 2015, about 50k news/month
 Over 1M news articles at present
 ~70 tags (annotations) per news article
o Tags link text mentions of concepts to the knowledge graph
 Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases
GraphDB Workbench: Class Instances & Hierarchy
#31
GraphDB Workbench: Class Relations
#32
Explore Your Data Graph
Navigate to Explore -> Visual graph. You see a search input to choose a
resource as a starting point for graph exploration.
#33
#34
Visualize the Graph of Sofia
#35
Visualize the Graph of Sofia
Visual Graph: Node details
#36
SPARQL Editing
o GDB WB integrates the YASGUI editor
o Automatic prefix addition (best practice: load prefixes.ttl)
o Autocompletion of URIs (classes, properties, entities…)
#37
FactForge Saved Queries
FactForge
query F04
#38
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
o The RDFRank plug-in of GraphDB™ computes PageRank importance
 It identifies “important” nodes in an RDF graph based on their interconnectedness
o Ranks accessible via rank:hasRDFRank predicate
o Incremental RDF Rank calculation is useful for dynamic data
o Example: to select the top 100 important nodes in the RDF graph:
RDFRank: Importance of Node in a Graph
PREFIX rank: http://www.ontotext.com/owlim/RDFRank#
SELECT * { ?n rank:hasRDFRank ?r }
ORDER BY DESC(?r)
LIMIT 100
RDFRank Plug-in Configuration and Usage
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
Semantic Media Monitoring
For each entity:
opopularity
trends
orelevant news
orelated
entities
oknowledge
graph
information
43
Semantic Media Monitoring/Press-Clipping
o We can trace references to specific company in the news
 This is pretty much standard, however we can deal with syntactic variations in the names,
because state of the art Named Entity Recognition technology is used
 What’s more important, we distinguish correctly in which mention “Paris” refers to which
of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the
Greek hero)
o We can trace and consolidate references to daughter companies
o We have comprehensive industry classification
#44
Media Monitoring Queries
o F6: News that mention a specific entity during specific period
o F7: Most popular companies per industry
o F8: Most popular companies per industry, including children
#45
News Popularity Ranking of Companies
o Can be customized by industry, region and time period
o Unique features:
 It is based on live streaming news
 Tracks also mentions of subsidiaries
o Rank uses the industry sectors of DBPedia with several refinements
 About 40 top-industry sectors
 Sectors are linked in a hierarchical taxonomy (all together 251 sectors)
 Industry sectors are de-duplicated (all designators used in Wikipedia are about 9 000)
o Try rank.ontotext.com
#46
rank.ontotext.com Demonstrator
#47
rank.ontotext.com Demonstrator
#48
Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
Why?
o I want to search in knowledge graphs!
 Because they are too big and I don’t have years to write “proper” SPARQL
o I want to detect duplicates
o I want to reify entities across sources
 For disambiguation of entity references in text
 For reconciliation or for linking or data integration
o Adapt Information Retrieval (IR) ideas
o IR’s basic is the Vector space model
 Documents are vectors in a geometrical space that has one dimension for each word
 Coordinates correspond to the number of times that the word appears in the document
 Imagine there are only 5 words used in all documents: a, ab, acb, ba, bc
 Document content: “a bc ba a abc”
 Document vector = <2,0,1,1,1>
 Similarity is measure by the COS of the angle between the vectors
 Very popular and simple. Read more at https://en.wikipedia.org/wiki/Vector_space_model
How It Can Work?
#51
o A simple model: Nodes are described by the outgoing edges
 I.e. PO-pairs <predicate,object> (key and value, relation type and target)
 Concern: there are too many different PO-pairs in a big KG
o For FactForge, it would be a space with 100M+ dimensions …
o Assumption: two nodes are more similar if they share more PO-pairs
o I.e. they are connected to the same other nodes with the same types of relationships
 Weighted sum: weight PO-pairs based on their “information value”
 More “popular” pairs are less informative; known as TF.IDF
o Experiment with node similarity in FactForge.net
Vector Space Model for Knowledge Graphs
#52
Example: Nodes Sharing Outgoing Links
id:IOEW2389W id:IOP235JKO
San Francisco
Aerospace
“BFR LLC”name
“Venus Express Inc.”
name
Beta Inc.
parent
shared edge
differentiating edge
“Venus Express Inc.”
#53
Demonstration
1. Shared edges for a node
2. Shared edges for a node, filtered by predicate
3. Similar nodes by count of shared edges
4. Similar nodes by count of shared edges, typed
5. Shared pairs for a node, by popularity
6. Node similarity as weighted sum of shared pairs
NOTE: Saved queries at FactForge.net (named SH*)
#54
rank.ontotext.com Demonstrator
o Experiment with better TF.IDF normalizations
o Take into consideration similarities between Ps and Os
 <locatedIn,Austin> and <hasOfficeIn,Texas> are not unrelated
o Combine with other context info: co-occurrence, RDFRank, …
o Optimize the computation of such similarity
Next Steps on Similarity in KG
Thank you!
Experience the technology with our demonstrators
NOW: Semantic News Portal http://now.ontotext.com
RANK: News popularity ranking for companies http://rank.ontotext.com
FactForge: Hub for open data and news about People and Organizations
http://factforge.net
#57

More Related Content

More from Ontotext

Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Ontotext
 
Hercule: Journalist Platform to Find Breaking News and Fight Fake Ones
Hercule: Journalist Platform to Find Breaking News and Fight Fake OnesHercule: Journalist Platform to Find Breaking News and Fight Fake Ones
Hercule: Journalist Platform to Find Breaking News and Fight Fake OnesOntotext
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps Ontotext
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
 
[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...
[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...
[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...Ontotext
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformOntotext
 
How is smart data cooked?
How is smart data cooked?How is smart data cooked?
How is smart data cooked?Ontotext
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessOntotext
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
The Knowledge Discovery Quest
The Knowledge Discovery Quest The Knowledge Discovery Quest
The Knowledge Discovery Quest Ontotext
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
Semantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial ResearchSemantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial ResearchOntotext
 
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataGain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataOntotext
 
Gaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive TechnologyGaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive TechnologyOntotext
 
Cooking up the Semantic Web
Cooking up the Semantic WebCooking up the Semantic Web
Cooking up the Semantic WebOntotext
 
Diving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsDiving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsOntotext
 
How to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsOntotext
 
Why Semantics Matter? Adding the semantic edge to your content, right from au...
Why Semantics Matter? Adding the semantic edge to your content,right from au...Why Semantics Matter? Adding the semantic edge to your content,right from au...
Why Semantics Matter? Adding the semantic edge to your content, right from au...Ontotext
 

More from Ontotext (20)

Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
 
Hercule: Journalist Platform to Find Breaking News and Fight Fake Ones
Hercule: Journalist Platform to Find Breaking News and Fight Fake OnesHercule: Journalist Platform to Find Breaking News and Fight Fake Ones
Hercule: Journalist Platform to Find Breaking News and Fight Fake Ones
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
 
[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...
[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...
[Webinar] FactForge Debuts: Trump World Data and Instant Ranking of Industry ...
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
 
How is smart data cooked?
How is smart data cooked?How is smart data cooked?
How is smart data cooked?
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
The Knowledge Discovery Quest
The Knowledge Discovery Quest The Knowledge Discovery Quest
The Knowledge Discovery Quest
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Semantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial ResearchSemantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial Research
 
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataGain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
 
Gaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive TechnologyGaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive Technology
 
Cooking up the Semantic Web
Cooking up the Semantic WebCooking up the Semantic Web
Cooking up the Semantic Web
 
Diving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsDiving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging News
 
How to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk Analytics
 
Why Semantics Matter? Adding the semantic edge to your content, right from au...
Why Semantics Matter? Adding the semantic edge to your content,right from au...Why Semantics Matter? Adding the semantic edge to your content,right from au...
Why Semantics Matter? Adding the semantic edge to your content, right from au...
 

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 

[Webinar] Graph Analytics on Company Data and News

  • 1. Cognitive Graph Analytics on Company Data and News Webinar, Mar 2018 Popularity ranking, importance and similarity
  • 2. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 3. Technology Excellence Delivered o Unique technology mix: GraphDBTM engine + Text mining o Robust technology: We power BBC.CO.UK/SPORT and FT.COM o We serve many of the most knowledge intensive enterprises: FT Financial Times Houses of Parliament Information & Communications Technology
  • 4. Linking Text to Knowledge Graphs 1. Integrate relevant data from many sources  Build a Big Knowledge Graph from proprietary databases and taxonomies, integrated with millions of facts of Linked Data 2. Infer new facts and unveil relationships  Performing reasoning across data from different sources 3. Link text reference to a big Knowledge Graphs  Using text-mining to automatically discover references to concepts and entities 4. Use GraphDB for metadata management, querying and search
  • 5.
  • 6. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 7. Cognitive Systems: Requirements Enterprise cognitive systems should: 1. Know the key concepts and entities in the domain 2. Be able to recognize in text mentions of entities 3. Be able to reconciliate entities across data sources and systems 4. Combine global intelligence with proprietary data and perspective 5. Evolve and learn, without cognitive expert engineering
  • 8. Cognitive Systems: Motivation Adequate business decisions require global information!  Analytics cannot deliver deep insights only based on proprietary data  Broader context and signals are needed
  • 9. Cognitive Systems: Motivation Developing from scratch cognitive system with global knowledge is unfeasible for most of the enterprises  We don’t grow and educate people tailor-made to serve single enterprise, right?
  • 10. Cognitive Reading Machines o Knowledge and awareness  Factual information about concepts, entities, events , … o Reading capabilities:  Lexical and grammatical knowledge to comprehend text  Recognize concepts, entities and events o Resolving ambiguity: whether “Paris” refers to "Paris, France“, “Paris, TX” or "Paris Hilton" o Learning capabilities: learn new facts from text  Read and analyze information in order to extend their knowledge o Get better entity profiles and relations  Get better in recognizing concepts and entities
  • 11. Context and Awareness o Context allows concepts to be identified, the way people do o Sufficiently big knowledge graph provides context for the entity  Differentiating features and similar nodes  How important and how popular it is  Related entities and concepts  Entities it is typically mentioned together with (co-occurrence) o This is awareness! o The kind of knowledge that people mean saying "I am aware of X" or "She is cognizant of Y"
  • 12. The Critical Mass Malcolm Gladwell claims that one needs to devote 10 000 hours to become an expert in something, e.g. violin or hokey (Outliers)
  • 13. The Critical Mass o A cognitive system needs:  To know 1B facts  About 100M concepts and entities  Read 1M news articles o In order to reach concept and entity awareness in a specific domain  The level of awareness that people mean saying “My background is X”
  • 14. Let’s play an Awareness game! o Important airports near London? o The most popular banks in US? o Companies similar to Google? o People mentioned together with IBM in news?
  • 15. We are getting closer! o Our Business Knowledge Model can already answer many of these questions better than you o Most of this intelligence is available in the Ontotext Platform o Knowledge model = KG + text mining + analytics o We already offer two such knowledge models:  Business and general news: one for processing general business master data (like people, organizations, locations and their mentions in the news)  Life sciences and healthcare
  • 16. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 17. o POL data is the most common type of master/reference data  Considering business applications and news o Open POL data is available in vast quantities  Geonames covers locations exhaustively; DBPedia covers well popular POL entities; Wikidata, …  Open company data grows: OpenCorporates, GLEI, open national registers, various “data leaks” o Within 3 years exhaustive global POL data will be commodity!  And they will be widely used for BI and decision making o Ontotext delivers Global POL data solutions today.  We aim to make them more affordable with more cognitive analytics Person, Organization, Location (POL) Data
  • 18. Oct 2016 Company Data Species (1/2) Category Representatives Size (Orgs.) Exhaustive Global Databases Dun & Bradstreet, BvD, Factset > 200M Rich Company Databases Capital IQ (S&P), Thomson Reuters (various) 5-10M Investment Databases CrunchBase, PitchBook, CBI, DJ Venture Source 200-600K Very Big Open Databases OpenCorporates 130M Global Official Open Databases GLEI (Global Legal Identifier), EU BRIS 1-30M Open Encyclopedic DBPedia, Wikidata 0.3-1.2M Open Leaks and Investigations Panama Papers (Offshore Leaks), Trump World Data 3-300K
  • 19. Oct 2016 Company Data Species (2/2) Category Loca tions Industry Classi- fication High Tech. Fields Invest. Info Org-Org Relations (e.g.Tree) Org- Person Relations Clean, Correct, Predictable Exhaustive Global Databases ++ +/- - - ++ +/- 6 Rich Company Databases ++ + +/- +/- ++ +/- 8 Investment Databases +/- +/- + + ++/- +/- 4-6 Very Big Open Databases + +/- - - +/- - 8 Global Official Open Databases + - - - +/- - 8 Open Encyclopedic +/- +/- + - +/- + 3-5 Open Leaks and Investigations +/- - - - +/- + 4-6
  • 20. o Integrate European company and economic data o euBusinessGraph will overcome barriers in company data provisioning o Technology and research partners: SINTEF (coord.), Ontotext, IJS, Uni. Milano
  • 21. Matching and Overlap o Organizations matched across: CrunchBase (CB), CB Insights (CBI), Capital IQ (CIQ), DJ Venture Source, … o The Venn diagram presents the overlap between sources  The size of the circle indicates number of entities per source  The level of overlap indicates number of entities matched between the two sources
  • 22. Oct 2016 Slice and Dice by Data Source, Industry and High-tech Field
  • 25. Entity Matching Across Datasets o Match IDs of one of the same real entity across different databases o Data Challenges  Different schemata  Name variations  Different classifications and codes  Lack of unique identifiers (even ticker symbols are not unique) o Technology challenges  Pre-selection is needed; brute-force matching is not good for 1M against 5M companies  It is not trivial to come up with good pre-selection mechanism
  • 26. Company Matching Methodology o Load all datasets in a single knowledge graph  Schema mapping: RDF-ize all datasets using an unified ontology o Map the industry classifications used across the sources o Map all locations to Geonames o Pre-selection: use FTS on names and descriptions  GraphDB’s connectors for Lucene or ElasticSearch work perfect for this o Develop Golden Standard (positive and negative examples) o Calculate matching scores based on structured criteria
  • 27. Company Matching Sample Project o We can matched 5+ big datasets within couple of months o Fully automated procedure, which takes few hours to execute  90% SPARQL and GraphDB’s FTS connectors o One can get to about 85% F-Score with simple structural matching rules with weights o To get higher accuracy, you need:  Massive amount of manual work and fine-tuning of weights … or  Cognitive analytics (importance, similarity, highly accurate named entity recognition, etc.)
  • 28. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 29. FactForge: Data Integration o DBpedia (the English version) 496M o Geonames (all geographic features on Earth) 150M o owl:sameAs links between DBpedia and Geonames 471K o GLEI (global company register data) 3M o Panama Papers DB (#LinkedLeaks) 20M o Other datasets and ontologies: WordNet, WorldFacts, FIBO o News metadata (2000 articles/day enriched by NOW) 673M o Total size (1.8B explicit + 328M inferred statements) 2 168М
  • 30. News Metadata o Metadata from Ontotext’s Dynamic Semantic Publishing platform  News stream from Google  Automatically generated as part of the NOW.ontotext.com semantic news showcase o News stream from Google since Feb 2015, about 50k news/month  Over 1M news articles at present  ~70 tags (annotations) per news article o Tags link text mentions of concepts to the knowledge graph  Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases
  • 31. GraphDB Workbench: Class Instances & Hierarchy #31
  • 32. GraphDB Workbench: Class Relations #32
  • 33. Explore Your Data Graph Navigate to Explore -> Visual graph. You see a search input to choose a resource as a starting point for graph exploration. #33
  • 36. Visual Graph: Node details #36
  • 37. SPARQL Editing o GDB WB integrates the YASGUI editor o Automatic prefix addition (best practice: load prefixes.ttl) o Autocompletion of URIs (classes, properties, entities…) #37
  • 39. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 40. o The RDFRank plug-in of GraphDB™ computes PageRank importance  It identifies “important” nodes in an RDF graph based on their interconnectedness o Ranks accessible via rank:hasRDFRank predicate o Incremental RDF Rank calculation is useful for dynamic data o Example: to select the top 100 important nodes in the RDF graph: RDFRank: Importance of Node in a Graph PREFIX rank: http://www.ontotext.com/owlim/RDFRank# SELECT * { ?n rank:hasRDFRank ?r } ORDER BY DESC(?r) LIMIT 100
  • 42. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 43. Semantic Media Monitoring For each entity: opopularity trends orelevant news orelated entities oknowledge graph information 43
  • 44. Semantic Media Monitoring/Press-Clipping o We can trace references to specific company in the news  This is pretty much standard, however we can deal with syntactic variations in the names, because state of the art Named Entity Recognition technology is used  What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero) o We can trace and consolidate references to daughter companies o We have comprehensive industry classification #44
  • 45. Media Monitoring Queries o F6: News that mention a specific entity during specific period o F7: Most popular companies per industry o F8: Most popular companies per industry, including children #45
  • 46. News Popularity Ranking of Companies o Can be customized by industry, region and time period o Unique features:  It is based on live streaming news  Tracks also mentions of subsidiaries o Rank uses the industry sectors of DBPedia with several refinements  About 40 top-industry sectors  Sectors are linked in a hierarchical taxonomy (all together 251 sectors)  Industry sectors are de-duplicated (all designators used in Wikipedia are about 9 000) o Try rank.ontotext.com #46
  • 49. Presentation Outline o Ontotext Introduction o Cognitive Analytics Meet Big Knowledge Graphs o Big Company Data: Knowing, Matching and Cleaning o FactForge as a Playground o RDFRank: Importance of Node in a Knowledge Graph o Popularity Ranking of Companies o Node Similarity in Knowledge Graphs Presentation Outline
  • 50. Why? o I want to search in knowledge graphs!  Because they are too big and I don’t have years to write “proper” SPARQL o I want to detect duplicates o I want to reify entities across sources  For disambiguation of entity references in text  For reconciliation or for linking or data integration
  • 51. o Adapt Information Retrieval (IR) ideas o IR’s basic is the Vector space model  Documents are vectors in a geometrical space that has one dimension for each word  Coordinates correspond to the number of times that the word appears in the document  Imagine there are only 5 words used in all documents: a, ab, acb, ba, bc  Document content: “a bc ba a abc”  Document vector = <2,0,1,1,1>  Similarity is measure by the COS of the angle between the vectors  Very popular and simple. Read more at https://en.wikipedia.org/wiki/Vector_space_model How It Can Work? #51
  • 52. o A simple model: Nodes are described by the outgoing edges  I.e. PO-pairs <predicate,object> (key and value, relation type and target)  Concern: there are too many different PO-pairs in a big KG o For FactForge, it would be a space with 100M+ dimensions … o Assumption: two nodes are more similar if they share more PO-pairs o I.e. they are connected to the same other nodes with the same types of relationships  Weighted sum: weight PO-pairs based on their “information value”  More “popular” pairs are less informative; known as TF.IDF o Experiment with node similarity in FactForge.net Vector Space Model for Knowledge Graphs #52
  • 53. Example: Nodes Sharing Outgoing Links id:IOEW2389W id:IOP235JKO San Francisco Aerospace “BFR LLC”name “Venus Express Inc.” name Beta Inc. parent shared edge differentiating edge “Venus Express Inc.” #53
  • 54. Demonstration 1. Shared edges for a node 2. Shared edges for a node, filtered by predicate 3. Similar nodes by count of shared edges 4. Similar nodes by count of shared edges, typed 5. Shared pairs for a node, by popularity 6. Node similarity as weighted sum of shared pairs NOTE: Saved queries at FactForge.net (named SH*) #54
  • 56. o Experiment with better TF.IDF normalizations o Take into consideration similarities between Ps and Os  <locatedIn,Austin> and <hasOfficeIn,Texas> are not unrelated o Combine with other context info: co-occurrence, RDFRank, … o Optimize the computation of such similarity Next Steps on Similarity in KG
  • 57. Thank you! Experience the technology with our demonstrators NOW: Semantic News Portal http://now.ontotext.com RANK: News popularity ranking for companies http://rank.ontotext.com FactForge: Hub for open data and news about People and Organizations http://factforge.net #57