In this live webinar Atanas Kiryakov, CEO of Ontotext demonstrated the power of Graph Analytics to create links between various datasets and lead to knowledge discovery. He also shared Ontotext’s comparison of different sources of company data, how you can combine and use them, what information you can get from each of those with what coverage, etc.
What is Advanced Excel and what are some best practices for designing and cre...
[Webinar] Graph Analytics on Company Data and News
1. Cognitive Graph Analytics
on Company Data and News
Webinar, Mar 2018
Popularity ranking, importance and similarity
2. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
3. Technology Excellence Delivered
o Unique technology mix: GraphDBTM engine + Text mining
o Robust technology: We power BBC.CO.UK/SPORT and FT.COM
o We serve many of the most knowledge intensive enterprises:
FT
Financial
Times
Houses of Parliament
Information & Communications Technology
4. Linking Text to Knowledge Graphs
1. Integrate relevant data from many sources
Build a Big Knowledge Graph from proprietary databases and
taxonomies, integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships
Performing reasoning across data from different sources
3. Link text reference to a big Knowledge Graphs
Using text-mining to automatically discover references to concepts and entities
4. Use GraphDB for metadata management, querying and search
5.
6. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
7. Cognitive Systems: Requirements
Enterprise cognitive systems should:
1. Know the key concepts and entities in the domain
2. Be able to recognize in text mentions of entities
3. Be able to reconciliate entities across data sources and systems
4. Combine global intelligence with proprietary data and perspective
5. Evolve and learn, without cognitive expert engineering
8. Cognitive Systems: Motivation
Adequate business decisions require global information!
Analytics cannot deliver deep insights only based on proprietary data
Broader context and signals are needed
9. Cognitive Systems: Motivation
Developing from scratch cognitive
system with global knowledge is
unfeasible for most of the enterprises
We don’t grow and educate people tailor-made to
serve single enterprise, right?
10. Cognitive Reading Machines
o Knowledge and awareness
Factual information about concepts, entities, events , …
o Reading capabilities:
Lexical and grammatical knowledge to comprehend text
Recognize concepts, entities and events
o Resolving ambiguity: whether “Paris” refers to "Paris, France“, “Paris, TX” or "Paris Hilton"
o Learning capabilities: learn new facts from text
Read and analyze information in order to extend their knowledge
o Get better entity profiles and relations
Get better in recognizing concepts and entities
11. Context and Awareness
o Context allows concepts to be identified, the way people do
o Sufficiently big knowledge graph provides context for the entity
Differentiating features and similar nodes
How important and how popular it is
Related entities and concepts
Entities it is typically mentioned together with (co-occurrence)
o This is awareness!
o The kind of knowledge that people mean saying "I am aware of X"
or "She is cognizant of Y"
12. The Critical Mass
Malcolm Gladwell claims that one needs
to devote 10 000 hours to become an
expert in something, e.g. violin or hokey
(Outliers)
13. The Critical Mass
o A cognitive system needs:
To know 1B facts
About 100M concepts and entities
Read 1M news articles
o In order to reach concept and entity
awareness in a specific domain
The level of awareness that people mean saying
“My background is X”
14. Let’s play an Awareness game!
o Important airports near London?
o The most popular banks in US?
o Companies similar to Google?
o People mentioned together with IBM in news?
15. We are getting closer!
o Our Business Knowledge Model can already answer many of
these questions better than you
o Most of this intelligence is available in the Ontotext Platform
o Knowledge model = KG + text mining + analytics
o We already offer two such knowledge models:
Business and general news: one for processing general business master data (like people,
organizations, locations and their mentions in the news)
Life sciences and healthcare
16. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
17. o POL data is the most common type of master/reference data
Considering business applications and news
o Open POL data is available in vast quantities
Geonames covers locations exhaustively; DBPedia covers well popular POL entities; Wikidata, …
Open company data grows: OpenCorporates, GLEI, open national registers, various “data leaks”
o Within 3 years exhaustive global POL data will be commodity!
And they will be widely used for BI and decision making
o Ontotext delivers Global POL data solutions today.
We aim to make them more affordable with more cognitive analytics
Person, Organization, Location (POL) Data
18. Oct
2016
Company Data Species (1/2)
Category Representatives Size (Orgs.)
Exhaustive Global Databases Dun & Bradstreet, BvD, Factset > 200M
Rich Company Databases Capital IQ (S&P), Thomson Reuters (various) 5-10M
Investment Databases CrunchBase, PitchBook, CBI, DJ Venture Source 200-600K
Very Big Open Databases OpenCorporates 130M
Global Official Open Databases GLEI (Global Legal Identifier), EU BRIS 1-30M
Open Encyclopedic DBPedia, Wikidata 0.3-1.2M
Open Leaks and Investigations Panama Papers (Offshore Leaks), Trump World Data 3-300K
19. Oct
2016
Company Data Species (2/2)
Category Loca
tions
Industry
Classi-
fication
High
Tech.
Fields
Invest.
Info
Org-Org
Relations
(e.g.Tree)
Org-
Person
Relations
Clean,
Correct,
Predictable
Exhaustive Global Databases ++ +/- - - ++ +/- 6
Rich Company Databases ++ + +/- +/- ++ +/- 8
Investment Databases +/- +/- + + ++/- +/- 4-6
Very Big Open Databases + +/- - - +/- - 8
Global Official Open Databases + - - - +/- - 8
Open Encyclopedic +/- +/- + - +/- + 3-5
Open Leaks and Investigations +/- - - - +/- + 4-6
20. o Integrate European company and economic data
o euBusinessGraph will overcome barriers in company data provisioning
o Technology and research partners: SINTEF (coord.), Ontotext, IJS, Uni. Milano
21. Matching and Overlap
o Organizations matched across:
CrunchBase (CB), CB Insights
(CBI), Capital IQ (CIQ), DJ
Venture Source, …
o The Venn diagram presents
the overlap between sources
The size of the circle indicates
number of entities per source
The level of overlap indicates number
of entities matched between the two
sources
25. Entity Matching Across Datasets
o Match IDs of one of the same real entity across different databases
o Data Challenges
Different schemata
Name variations
Different classifications and codes
Lack of unique identifiers (even ticker symbols are not unique)
o Technology challenges
Pre-selection is needed; brute-force matching is not good for 1M against 5M companies
It is not trivial to come up with good pre-selection mechanism
26. Company Matching Methodology
o Load all datasets in a single knowledge graph
Schema mapping: RDF-ize all datasets using an unified ontology
o Map the industry classifications used across the sources
o Map all locations to Geonames
o Pre-selection: use FTS on names and descriptions
GraphDB’s connectors for Lucene or ElasticSearch work perfect for this
o Develop Golden Standard (positive and negative examples)
o Calculate matching scores based on structured criteria
27. Company Matching Sample Project
o We can matched 5+ big datasets within couple of months
o Fully automated procedure, which takes few hours to execute
90% SPARQL and GraphDB’s FTS connectors
o One can get to about 85% F-Score with simple structural matching
rules with weights
o To get higher accuracy, you need:
Massive amount of manual work and fine-tuning of weights … or
Cognitive analytics (importance, similarity, highly accurate named entity recognition, etc.)
28. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
29. FactForge: Data Integration
o DBpedia (the English version) 496M
o Geonames (all geographic features on Earth) 150M
o owl:sameAs links between DBpedia and Geonames 471K
o GLEI (global company register data) 3M
o Panama Papers DB (#LinkedLeaks) 20M
o Other datasets and ontologies: WordNet, WorldFacts, FIBO
o News metadata (2000 articles/day enriched by NOW) 673M
o Total size (1.8B explicit + 328M inferred statements) 2 168М
30. News Metadata
o Metadata from Ontotext’s Dynamic Semantic Publishing platform
News stream from Google
Automatically generated as part of the NOW.ontotext.com semantic news showcase
o News stream from Google since Feb 2015, about 50k news/month
Over 1M news articles at present
~70 tags (annotations) per news article
o Tags link text mentions of concepts to the knowledge graph
Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases
33. Explore Your Data Graph
Navigate to Explore -> Visual graph. You see a search input to choose a
resource as a starting point for graph exploration.
#33
39. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
40. o The RDFRank plug-in of GraphDB™ computes PageRank importance
It identifies “important” nodes in an RDF graph based on their interconnectedness
o Ranks accessible via rank:hasRDFRank predicate
o Incremental RDF Rank calculation is useful for dynamic data
o Example: to select the top 100 important nodes in the RDF graph:
RDFRank: Importance of Node in a Graph
PREFIX rank: http://www.ontotext.com/owlim/RDFRank#
SELECT * { ?n rank:hasRDFRank ?r }
ORDER BY DESC(?r)
LIMIT 100
42. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
43. Semantic Media Monitoring
For each entity:
opopularity
trends
orelevant news
orelated
entities
oknowledge
graph
information
43
44. Semantic Media Monitoring/Press-Clipping
o We can trace references to specific company in the news
This is pretty much standard, however we can deal with syntactic variations in the names,
because state of the art Named Entity Recognition technology is used
What’s more important, we distinguish correctly in which mention “Paris” refers to which
of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the
Greek hero)
o We can trace and consolidate references to daughter companies
o We have comprehensive industry classification
#44
45. Media Monitoring Queries
o F6: News that mention a specific entity during specific period
o F7: Most popular companies per industry
o F8: Most popular companies per industry, including children
#45
46. News Popularity Ranking of Companies
o Can be customized by industry, region and time period
o Unique features:
It is based on live streaming news
Tracks also mentions of subsidiaries
o Rank uses the industry sectors of DBPedia with several refinements
About 40 top-industry sectors
Sectors are linked in a hierarchical taxonomy (all together 251 sectors)
Industry sectors are de-duplicated (all designators used in Wikipedia are about 9 000)
o Try rank.ontotext.com
#46
49. Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline
50. Why?
o I want to search in knowledge graphs!
Because they are too big and I don’t have years to write “proper” SPARQL
o I want to detect duplicates
o I want to reify entities across sources
For disambiguation of entity references in text
For reconciliation or for linking or data integration
51. o Adapt Information Retrieval (IR) ideas
o IR’s basic is the Vector space model
Documents are vectors in a geometrical space that has one dimension for each word
Coordinates correspond to the number of times that the word appears in the document
Imagine there are only 5 words used in all documents: a, ab, acb, ba, bc
Document content: “a bc ba a abc”
Document vector = <2,0,1,1,1>
Similarity is measure by the COS of the angle between the vectors
Very popular and simple. Read more at https://en.wikipedia.org/wiki/Vector_space_model
How It Can Work?
#51
52. o A simple model: Nodes are described by the outgoing edges
I.e. PO-pairs <predicate,object> (key and value, relation type and target)
Concern: there are too many different PO-pairs in a big KG
o For FactForge, it would be a space with 100M+ dimensions …
o Assumption: two nodes are more similar if they share more PO-pairs
o I.e. they are connected to the same other nodes with the same types of relationships
Weighted sum: weight PO-pairs based on their “information value”
More “popular” pairs are less informative; known as TF.IDF
o Experiment with node similarity in FactForge.net
Vector Space Model for Knowledge Graphs
#52
53. Example: Nodes Sharing Outgoing Links
id:IOEW2389W id:IOP235JKO
San Francisco
Aerospace
“BFR LLC”name
“Venus Express Inc.”
name
Beta Inc.
parent
shared edge
differentiating edge
“Venus Express Inc.”
#53
54. Demonstration
1. Shared edges for a node
2. Shared edges for a node, filtered by predicate
3. Similar nodes by count of shared edges
4. Similar nodes by count of shared edges, typed
5. Shared pairs for a node, by popularity
6. Node similarity as weighted sum of shared pairs
NOTE: Saved queries at FactForge.net (named SH*)
#54
56. o Experiment with better TF.IDF normalizations
o Take into consideration similarities between Ps and Os
<locatedIn,Austin> and <hasOfficeIn,Texas> are not unrelated
o Combine with other context info: co-occurrence, RDFRank, …
o Optimize the computation of such similarity
Next Steps on Similarity in KG
57. Thank you!
Experience the technology with our demonstrators
NOW: Semantic News Portal http://now.ontotext.com
RANK: News popularity ranking for companies http://rank.ontotext.com
FactForge: Hub for open data and news about People and Organizations
http://factforge.net
#57