[Webinar] Graph Analytics on Company Data and News

Cognitive Graph Analytics
on Company Data and News
Webinar, Mar 2018
Popularity ranking, importance and similarity

Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline

Technology Excellence Delivered
o Unique technology mix: GraphDBTM engine + Text mining
o Robust technology: We power BBC.CO.UK/SPORT and FT.COM
o We serve many of the most knowledge intensive enterprises:
FT
Financial
Times
Houses of Parliament
Information & Communications Technology

Linking Text to Knowledge Graphs
1. Integrate relevant data from many sources
 Build a Big Knowledge Graph from proprietary databases and
taxonomies, integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships
 Performing reasoning across data from different sources
3. Link text reference to a big Knowledge Graphs
 Using text-mining to automatically discover references to concepts and entities
4. Use GraphDB for metadata management, querying and search

Cognitive Systems: Requirements
Enterprise cognitive systems should:
1. Know the key concepts and entities in the domain
2. Be able to recognize in text mentions of entities
3. Be able to reconciliate entities across data sources and systems
4. Combine global intelligence with proprietary data and perspective
5. Evolve and learn, without cognitive expert engineering

Cognitive Systems: Motivation
Adequate business decisions require global information!
 Analytics cannot deliver deep insights only based on proprietary data
 Broader context and signals are needed

Cognitive Systems: Motivation
Developing from scratch cognitive
system with global knowledge is
unfeasible for most of the enterprises
 We don’t grow and educate people tailor-made to
serve single enterprise, right?

Cognitive Reading Machines
o Knowledge and awareness
 Factual information about concepts, entities, events , …
o Reading capabilities:
 Lexical and grammatical knowledge to comprehend text
 Recognize concepts, entities and events
o Resolving ambiguity: whether “Paris” refers to "Paris, France“, “Paris, TX” or "Paris Hilton"
o Learning capabilities: learn new facts from text
 Read and analyze information in order to extend their knowledge
o Get better entity profiles and relations
 Get better in recognizing concepts and entities

Context and Awareness
o Context allows concepts to be identified, the way people do
o Sufficiently big knowledge graph provides context for the entity
 Differentiating features and similar nodes
 How important and how popular it is
 Related entities and concepts
 Entities it is typically mentioned together with (co-occurrence)
o This is awareness!
o The kind of knowledge that people mean saying "I am aware of X"
or "She is cognizant of Y"

The Critical Mass
Malcolm Gladwell claims that one needs
to devote 10 000 hours to become an
expert in something, e.g. violin or hokey
(Outliers)

The Critical Mass
o A cognitive system needs:
 To know 1B facts
 About 100M concepts and entities
 Read 1M news articles
o In order to reach concept and entity
awareness in a specific domain
 The level of awareness that people mean saying
“My background is X”

Let’s play an Awareness game!
o Important airports near London?
o The most popular banks in US?
o Companies similar to Google?
o People mentioned together with IBM in news?

We are getting closer!
o Our Business Knowledge Model can already answer many of
these questions better than you
o Most of this intelligence is available in the Ontotext Platform
o Knowledge model = KG + text mining + analytics
o We already offer two such knowledge models:
 Business and general news: one for processing general business master data (like people,
organizations, locations and their mentions in the news)
 Life sciences and healthcare

o POL data is the most common type of master/reference data
 Considering business applications and news
o Open POL data is available in vast quantities
 Geonames covers locations exhaustively; DBPedia covers well popular POL entities; Wikidata, …
 Open company data grows: OpenCorporates, GLEI, open national registers, various “data leaks”
o Within 3 years exhaustive global POL data will be commodity!
 And they will be widely used for BI and decision making
o Ontotext delivers Global POL data solutions today.
 We aim to make them more affordable with more cognitive analytics
Person, Organization, Location (POL) Data

Oct
2016
Company Data Species (1/2)
Category Representatives Size (Orgs.)
Exhaustive Global Databases Dun & Bradstreet, BvD, Factset > 200M
Rich Company Databases Capital IQ (S&P), Thomson Reuters (various) 5-10M
Investment Databases CrunchBase, PitchBook, CBI, DJ Venture Source 200-600K
Very Big Open Databases OpenCorporates 130M
Global Official Open Databases GLEI (Global Legal Identifier), EU BRIS 1-30M
Open Encyclopedic DBPedia, Wikidata 0.3-1.2M
Open Leaks and Investigations Panama Papers (Offshore Leaks), Trump World Data 3-300K

Oct
2016
Company Data Species (2/2)
Category Loca
tions
Industry
Classi-
fication
High
Tech.
Fields
Invest.
Info
Org-Org
Relations
(e.g.Tree)
Org-
Person
Relations
Clean,
Correct,
Predictable
Exhaustive Global Databases ++ +/- - - ++ +/- 6
Rich Company Databases ++ + +/- +/- ++ +/- 8
Investment Databases +/- +/- + + ++/- +/- 4-6
Very Big Open Databases + +/- - - +/- - 8
Global Official Open Databases + - - - +/- - 8
Open Encyclopedic +/- +/- + - +/- + 3-5
Open Leaks and Investigations +/- - - - +/- + 4-6

o Integrate European company and economic data
o euBusinessGraph will overcome barriers in company data provisioning
o Technology and research partners: SINTEF (coord.), Ontotext, IJS, Uni. Milano

Matching and Overlap
o Organizations matched across:
CrunchBase (CB), CB Insights
(CBI), Capital IQ (CIQ), DJ
Venture Source, …
o The Venn diagram presents
the overlap between sources
 The size of the circle indicates
number of entities per source
 The level of overlap indicates number
of entities matched between the two
sources

Oct
2016
Slice and Dice
by Data Source,
Industry and
High-tech Field

Oct
2016
Data Consolidation Across Data Sources (1/2)

Oct
2016
Data Consolidation Across Data Sources (2/2)

Entity Matching Across Datasets
o Match IDs of one of the same real entity across different databases
o Data Challenges
 Different schemata
 Name variations
 Different classifications and codes
 Lack of unique identifiers (even ticker symbols are not unique)
o Technology challenges
 Pre-selection is needed; brute-force matching is not good for 1M against 5M companies
 It is not trivial to come up with good pre-selection mechanism

Company Matching Methodology
o Load all datasets in a single knowledge graph
 Schema mapping: RDF-ize all datasets using an unified ontology
o Map the industry classifications used across the sources
o Map all locations to Geonames
o Pre-selection: use FTS on names and descriptions
 GraphDB’s connectors for Lucene or ElasticSearch work perfect for this
o Develop Golden Standard (positive and negative examples)
o Calculate matching scores based on structured criteria

Company Matching Sample Project
o We can matched 5+ big datasets within couple of months
o Fully automated procedure, which takes few hours to execute
 90% SPARQL and GraphDB’s FTS connectors
o One can get to about 85% F-Score with simple structural matching
rules with weights
o To get higher accuracy, you need:
 Massive amount of manual work and fine-tuning of weights … or
 Cognitive analytics (importance, similarity, highly accurate named entity recognition, etc.)

FactForge: Data Integration
o DBpedia (the English version) 496M
o Geonames (all geographic features on Earth) 150M
o owl:sameAs links between DBpedia and Geonames 471K
o GLEI (global company register data) 3M
o Panama Papers DB (#LinkedLeaks) 20M
o Other datasets and ontologies: WordNet, WorldFacts, FIBO
o News metadata (2000 articles/day enriched by NOW) 673M
o Total size (1.8B explicit + 328M inferred statements) 2 168М

News Metadata
o Metadata from Ontotext’s Dynamic Semantic Publishing platform
 News stream from Google
 Automatically generated as part of the NOW.ontotext.com semantic news showcase
o News stream from Google since Feb 2015, about 50k news/month
 Over 1M news articles at present
 ~70 tags (annotations) per news article
o Tags link text mentions of concepts to the knowledge graph
 Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases

GraphDB Workbench: Class Instances & Hierarchy
#31

GraphDB Workbench: Class Relations
#32

Explore Your Data Graph
Navigate to Explore -> Visual graph. You see a search input to choose a
resource as a starting point for graph exploration.
#33

#34
Visualize the Graph of Sofia

#35
Visualize the Graph of Sofia

Visual Graph: Node details
#36

SPARQL Editing
o GDB WB integrates the YASGUI editor
o Automatic prefix addition (best practice: load prefixes.ttl)
o Autocompletion of URIs (classes, properties, entities…)
#37

FactForge Saved Queries
FactForge
query F04
#38

o The RDFRank plug-in of GraphDB™ computes PageRank importance
 It identifies “important” nodes in an RDF graph based on their interconnectedness
o Ranks accessible via rank:hasRDFRank predicate
o Incremental RDF Rank calculation is useful for dynamic data
o Example: to select the top 100 important nodes in the RDF graph:
RDFRank: Importance of Node in a Graph
PREFIX rank: http://www.ontotext.com/owlim/RDFRank#
SELECT * { ?n rank:hasRDFRank ?r }
ORDER BY DESC(?r)
LIMIT 100

RDFRank Plug-in Configuration and Usage

Semantic Media Monitoring
For each entity:
opopularity
trends
orelevant news
orelated
entities
oknowledge
graph
information
43

Semantic Media Monitoring/Press-Clipping
o We can trace references to specific company in the news
 This is pretty much standard, however we can deal with syntactic variations in the names,
because state of the art Named Entity Recognition technology is used
 What’s more important, we distinguish correctly in which mention “Paris” refers to which
of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the
Greek hero)
o We can trace and consolidate references to daughter companies
o We have comprehensive industry classification
#44

Media Monitoring Queries
o F6: News that mention a specific entity during specific period
o F7: Most popular companies per industry
o F8: Most popular companies per industry, including children
#45

News Popularity Ranking of Companies
o Can be customized by industry, region and time period
o Unique features:
 It is based on live streaming news
 Tracks also mentions of subsidiaries
o Rank uses the industry sectors of DBPedia with several refinements
 About 40 top-industry sectors
 Sectors are linked in a hierarchical taxonomy (all together 251 sectors)
 Industry sectors are de-duplicated (all designators used in Wikipedia are about 9 000)
o Try rank.ontotext.com
#46

rank.ontotext.com Demonstrator
#47

#48

Why?
o I want to search in knowledge graphs!
 Because they are too big and I don’t have years to write “proper” SPARQL
o I want to detect duplicates
o I want to reify entities across sources
 For disambiguation of entity references in text
 For reconciliation or for linking or data integration

o Adapt Information Retrieval (IR) ideas
o IR’s basic is the Vector space model
 Documents are vectors in a geometrical space that has one dimension for each word
 Coordinates correspond to the number of times that the word appears in the document
 Imagine there are only 5 words used in all documents: a, ab, acb, ba, bc
 Document content: “a bc ba a abc”
 Document vector = <2,0,1,1,1>
 Similarity is measure by the COS of the angle between the vectors
 Very popular and simple. Read more at https://en.wikipedia.org/wiki/Vector_space_model
How It Can Work?
#51

o A simple model: Nodes are described by the outgoing edges
 I.e. PO-pairs <predicate,object> (key and value, relation type and target)
 Concern: there are too many different PO-pairs in a big KG
o For FactForge, it would be a space with 100M+ dimensions …
o Assumption: two nodes are more similar if they share more PO-pairs
o I.e. they are connected to the same other nodes with the same types of relationships
 Weighted sum: weight PO-pairs based on their “information value”
 More “popular” pairs are less informative; known as TF.IDF
o Experiment with node similarity in FactForge.net
Vector Space Model for Knowledge Graphs
#52

Example: Nodes Sharing Outgoing Links
id:IOEW2389W id:IOP235JKO
San Francisco
Aerospace
“BFR LLC”name
“Venus Express Inc.”
name
Beta Inc.
parent
shared edge
differentiating edge
“Venus Express Inc.”
#53

Demonstration
1. Shared edges for a node
2. Shared edges for a node, filtered by predicate
3. Similar nodes by count of shared edges
4. Similar nodes by count of shared edges, typed
5. Shared pairs for a node, by popularity
6. Node similarity as weighted sum of shared pairs
NOTE: Saved queries at FactForge.net (named SH*)
#54

o Experiment with better TF.IDF normalizations
o Take into consideration similarities between Ps and Os
 <locatedIn,Austin> and <hasOfficeIn,Texas> are not unrelated
o Combine with other context info: co-occurrence, RDFRank, …
o Optimize the computation of such similarity
Next Steps on Similarity in KG

Thank you!
Experience the technology with our demonstrators
NOW: Semantic News Portal http://now.ontotext.com
RANK: News popularity ranking for companies http://rank.ontotext.com
FactForge: Hub for open data and news about People and Organizations
http://factforge.net
#57

[Webinar] Graph Analytics on Company Data and News

Recommended

Recommended

More Related Content

More from Ontotext

More from Ontotext (20)

Recently uploaded

Recently uploaded (20)

[Webinar] Graph Analytics on Company Data and News