Cognitive Graph Analytics on Company Data and News

Cognitive Graph Analytics
on Company Data and News
Texas Data Day, Jan 2018
Popularity ranking, importance and similarity

Presentation Outline
o Ontotext Introduction
o Cognitive Analytics Meet Big Knowledge Graphs
o Big Company Data: Knowing, Matching and Cleaning
o FactForge as a Playground
o RDFRank: Importance of Node in a Knowledge Graph
o Popularity Ranking of Companies
o Node Similarity in Knowledge Graphs
Presentation Outline

History and Key Facts
o Started in year 2000 as Semantic Web pioneer
ü As R&D lab within Sirma; still part of Sirma Group – 400 persons, public company
ü Got spun-off and took VC investment in 2008
o 65 staff; R&D center in Bulgaria
ü Over 400 person-years invested in R&D
ü Multiple innovation & technology awards: Washington Post, BBC, FT, BAIT, etc.
o 80% sales in USA and UK
o Member of multiple industry bodies
ü W3C, EDMC, ODI, LDBC, STI, DBPedia Association

Technology Excellence Delivered
o Unique technology mix: GraphDBTM engine + Text mining
o Robust technology: We power BBC.CO.UK/SPORT and FT.COM
o We serve many of the most knowledge intensive enterprises:

GraphDB: Relation Discovery Case
o Find suspicious
relationships like:
ü Company in USA
ü Controls another
company in USA
ü Through a company in
an off-shore zone
o Show news
relevant to these
companies
#5

Linking Text to Knowledge Graphs
1. Integrate relevant data from many sources
ü Build a Big Knowledge Graph from proprietary databases and
taxonomies, integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships
ü Performing reasoning across data from different sources
3. Link text reference to a big Knowledge Graphs
ü Using text-mining to automatically discover references to concepts and entities
4. Use GraphDB for metadata management, querying and search

Cognitive Systems: Motivation
Adequate business decisions require global information!
ü Analytics cannot deliver deep insights only based on proprietary data
ü Broader context and signals are needed

Cognitive Systems: Motivation
Developing from scratch cognitive
system with global knowledge is
unfeasible for most of the enterprises
ü We don’t grow and educate people tailor-made to
serve single enterprise, right?

Cognitive Systems: Requirements
Enterprise cognitive systems should:
1. Know the key concepts and entities in the domain
2. Be able to recognize in text mentions of entities
3. Be able to reconciliate entities across data sources and systems
4. Combine global intelligence with proprietary data and perspective
5. Evolve and learn, without cognitive expert engineering

Cognitive Reading Machines
o Knowledge and awareness
ü Factual information about concepts, entities, events , …
o Reading capabilities:
ü Lexical and grammatical knowledge to comprehend text
ü Recognize concepts, entities and events
o Resolving ambiguity: whether “Paris” refers to "Paris, France“, “Paris, TX” or "Paris Hilton"
o Learning capabilities: learn new facts from text
ü Read and analyze information in order to extend their knowledge
o Get better entity profiles and relations
ü Get better in recognizing concepts and entities

Context and Awareness
o Context allows concepts to be identified, the way people do
o Sufficiently big knowledge graph provides context for the entity
ü Differentiating features and similar nodes
ü How important and how popular it is
ü Related entities and concepts
ü Entities it is typically mentioned together with (co-occurrence)
o This is awareness!
o The kind of knowledge that people mean saying "I am aware of X"
or "She is cognizant of Y"

The Critical Mass
Malcolm Gladwell claims that one needs
to devote 10 000 hours to become an
expert in something, e.g. violin or hokey
(Outliers)

The Critical Mass
o A cognitive system needs:
ü To know 1B facts
ü About 100M concepts and entities
ü Read 1M news articles
o In order to reach concept and entity
awareness in a specific domain
ü The level of awareness that people mean saying
“My background is X”

Commercial Company
Database
(e.g. D&B)
Social Media
News
Wikipedia
Privateo Knowledge Graphs
provide semantic
entity fingerprints
ü Each node contributes to the
description of the others
ü Network topology suggest
importance, similarity, etc.
o Such Graphs should
combine diverse
information
Knowledge Graphs
As Context

Let’s play an Awareness game:
o The important airports near San Francisco?
o The most popular banks in US?
o Companies similar to Google?
o People most often appearing with IBM in news?

We are getting closer!
o Our Business Knowledge Model can already answer many of
these questions better than you
o Most of this intelligence is available in the Ontotext Platform
o Knowledge model = KG + text mining + analytics
o We already offer two such knowledge models:
ü Business and general news: one for processing general business master data (like people,
organizations, locations and their mentions in the news)
ü Life sciences and healthcare

Oct
2016
o POL data is the most common type of master/reference data
ü Considering business applications and news
o Open POL data available in vast quantities
ü Geonames covers locations exhaustively; DBPedia covers well popular POL entities; Wikidata, …
ü Open company data grows: OpenCorporates, GLEI, open national registers, various “data leaks”
ü Within 3 years exhaustive global POL data will be commodity!
o Ontotext delivers Global POL data solutions today
ü Integrating open and commercial POL data – a billion triples demo at http://factforge.net
ü Looking up and disambiguating references to these entities in the text, http://now.ontotext.com
ü Rankings and trends for each of these entities by news popularity – demo http://rank.ontotext.com
Person, Organization, Location (POL) Data

Mapping Datasets to DBPedia
o The task: Map people, organizations and locations to IDs in DBPedia
ü So that we can analyze the original data with the help of the extra information available in
DBPedia and other datasets that are related to it, e.g. Geonames
ü For instance, the data from GLEI doesn’t contain any extra information about the companies,
e.g. industry sector, products, etc.
o Specific conditions: map by names and locations
ü There аre little attributes and features common for both GLEI and DBPedia
ü We mapped locations only in terms of countries and/or cities not finer grained locations

o GraphDB’s Lucene Connector used for preselection
ü GraphDB Lucene connector index is created for Organizations and People from DBPedia,
indexing all sorts of names, descriptions and other textual information for each entity
ü The name of the entity from the 3rd party dataset (e.g. Panama Papers or GLEI) is used as a
FTS query, embedded in a SPARQL query
ü Next filter by location, industry and other criteria (e.g. RDFRank, similarity, … )
o What is that Lucence does better than SPARQL?
ü When there is little information other than the name, Lucene deals well with minor syntactic
variations and sorts the results by relevance
ü When mapping 300 000 organizations against 500 000 organizations, without a key, the
complexity of a SPARQL query is 300 000 x 500 000, way slower that 300 000 Lucene queries
Mapping datasets to DBPedia (2)

Mapping GLEI to DBPedia
o Data Pre-processing in DBPedia
ü Generate normalized names (lower case, suffix removal, etc.)
ü Generated primary city and primary country for each organization in DBPedia
o Also cleaned up data about HQ locations, etc. (via SPARQL queries)
o Iterative matching
ü Match first those that have high relevance and match better constraints by location and country
o Matching outcome
ü skos:exactMatch: 3880 matches
ü skos:closeMatch: 5825 matches

Oct
2016
Company Data Species (1/2)
Category Representatives Size (Orgs.)
Exhaustive Global Databases Dun & Bradstreet, BvD, Factset > 200M
Rich Company Databases Capital IQ (S&P), Thomson Reuters (various) 5-10M
Investment Databases CrunchBase, PitchBook, CBI, DJ Venture Source 200-600K
Very Big Open Databases OpenCorporates 130M
Global Official Open Databases GLEI (Global Legal Identifier) 1M
Open Encyclopedic DBPeida, Wikidata 300-500K
Open Leaks and Investigations Panama Papers (Offshore Leaks), Trump World Data 3-300K

Oct
2016
Company Data Species (2/2)
Category Loca
tions
Major
Industry
Classif.
Tech.
Fields
Invest.
Info
Org-Org
Relations
(e.g.Tree)
Org-
Person
Relations
Clean,
Correct,
Predictable
Exhaustive Global Databases ++ +/- - - +/- +/- 5
Rich Company Databases ++ + +/- +/- ++ +/- 8
Investment Databases +/- +/- +/- + +/- +/- 4-5
Very Big Open Databases + - - - +/- - 3
Global Official Open Databases + - - - +/- - 5
Open Encyclopedic +/- +/- + - +/- + 4-6
Open Leaks and Investigations +/- - - - +/- + 3-5

Oct
2016
o Similar methodology as for matching open data to DBPedia
o Intensive usage of normalized locations and industry sectors
ü Locations across all datasets mapped to Geonames
ü Matching industry classification across the sources (e.g. to GICS)
o Use of identity-suggesting attributes (e.g. website and ticker)
ü Sounds like low hanging fruit … but non of those is trouble-free
o When matching 5+ datasets, precision becomes critical
ü A cluster of identifiers of one and the same company identifiers can contain 5
ü If one of them is included by mistake, the whole cluster is compromised
Matching Commercial Company Datasets

Oct
2016
Matching and Overlap
o Organizations matched across:
CrunchBase (CB), CB Insights
(CBI), Capital IQ (CIQ), DJ
Venture Source, …
o The Venn diagram presents
the overlap between sources
ü The size of the circle indicates
number of entities per source
ü The level of overlap indicates number
of entities matched between the two
sources

Oct
2016
Slice and Dice
by Data Source,
Industry and
High-tech Field

Oct
2016
Data Consolidation Across Data Sources (1/2)

Oct
2016
Data Consolidation Across Data Sources (2/2)

o Integrate European company and economic data
o euBusinessGraph will overcome barriers in company data provisioning
o Technology and research partners: SINTEF (coord.), Ontotext, IJS, Uni. Milano

FactForge: Data Integration
o DBpedia (the English version) 496M
o Geonames (all geographic features on Earth) 150M
o owl:sameAs links between DBpedia and Geonames 471K
o GLEI (global company register data) 3M
o Panama Papers DB (#LinkedLeaks) 20M
o Other datasets and ontologies: WordNet, WorldFacts, FIBO
o News metadata (2000 articles/day enriched by NOW) 673M
o Total size (1.8B explicit + 328M inferred statements) 2 168М

News Metadata
o Metadata from Ontotext’s Dynamic Semantic Publishing platform
ü News stream from Google
ü Automatically generated as part of the NOW.ontotext.com semantic news showcase
o News stream from Google since Feb 2015, about 50k news/month
ü Over 1M news articles at present
ü ~70 tags (annotations) per news article
o Tags link text mentions of concepts to the knowledge graph
ü Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases

GraphDB Workbench : Class Instances & Hierarchy
#35

GraphDB Workbench: Class Relations
#36

Explore Your Data Graph
Navigate to Explore -> Visual graph. You see a search input to choose a
resource as a starting point for graph exploration.
#37

#38
Visualize the Graph of Sofia

#39
Visualize the Graph of Sofia

Visual Graph: Node details
#40

SPARQL Editing
o GDB WB integrates the YASGUI editor
o Automatic prefix addition (best practice: load prefixes.ttl)
o Autocompletion of URIs (classes, properties, entities…)
#41

FactForge Saved Queries
FactForge
query F04
#42

o The RDFRank plug-in of GraphDB™ calculates PageRank importance
ü It identifies “important” nodes in an RDF graph based on their interconnectedness
o The values can be accessed using the rank:hasRDFRank predicate
o Incremental RDF Rank calculation is useful for dynamic data
o For example, to select the top 100 important nodes in the RDF graph:
RDFRank: Importance of Node in a Graph
PREFIX rank: http://www.ontotext.com/owlim/RDFRank#
SELECT * { ?n rank:hasRDFRank ?r }
ORDER BY DESC(?r)
LIMIT 100

RDFRank Plug-in Configuration and Usage

Semantic Media Monitoring
For each entity:
opopularity
trends
orelevant news
orelated
entities
oknowledge
graph
information
47

Semantic Media Monitoring/Press-Clipping
o We can trace references to a specific company in the news
ü This is pretty much standard, however we can deal with syntactic variations in the names,
because state of the art Named Entity Recognition technology is used
ü What’s more important, we distinguish correctly in which mention “Paris” refers to which
of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the
Greek hero)
o We can trace and consolidate references to daughter companies
o We have comprehensive industry classification
#48

Media Monitoring Queries
o F6: News that mention a specific entity during specific period
o F7: Most popular companies per industry
o F8: Most popular companies per industry, including children
#49

News Popularity Ranking of Companies
o Can be customized by industry, region and time period
o Unique features:
ü It is based on live streaming news
ü Tracks also mentions of subsidiaries
o Rank uses the industry sectors of DBPedia with several refinements
ü About 40 top-industry sectors
ü Sectors are linked in a hierarchical taxonomy (all together 251 sectors)
ü Industry sectors are de-duplicated (all designators used in Wikipedia are about 9 000)
o Try rank.ontotext.com
#50

rank.ontotext.com Demonstrator
#51

#52

Why?
o I want to search in knowledge graphs!
ü Because they are too big and I don’t have years to write “proper” SPARQL
o I want to detect duplicates
o I want to reify entities across sources
ü For disambiguation of entity references in text
ü For reconciliation or for linking or data integration

o Adapt Information Retrieval (IR) ideas
o IR’s basic is the Vector space model
ü Documents are vectors in a geometrical space that has one dimension for each word
ü Coordinates correspond to the number of times that the word appears in the document
ü Imagine there are only 5 words used in all documents: a, ab, acb, ba, bc
ü Document content: “a bc ba a abc”
ü Document vector = <2,0,1,1,1>
ü Similarity is measure by the COS of the angle between the vectors
ü Very popular and simple. Read more at https://en.wikipedia.org/wiki/Vector_space_model
How It Can Work?
#55

o A simple model: Nodes are described by the outgoing edges
ü I.e. PO-pairs <predicate,object> (key and value, relation type and target)
ü Concern: there are too many different PO-pairs in a big KG
o For FactForge, it would be a space with 100M+ dimensions …
o Assumption: two nodes are more similar if they share more PO-pairs
o I.e. they are connected to the same other nodes with the same types of relationships
ü Weighted sum: weight PO-pairs based on their “information value”
ü More “popular” pairs are less informative; known as TF.IDF
o Experiment with node similarity in FactForge.net
Vector Space Model for Knowledge Graphs
#56

Example: Nodes Sharing Outgoing Links
id:IOEW2389W id:IOP235JKO
San Francisco
Aerospace
“BFR LLC”name
“Venus Express Inc.”
name
Beta Inc.
parent
shared edge
differentiating edge
“Venus Express Inc.”
#57

Demonstration
1. Shared edges for a node
2. Shared edges for a node, filtered by predicate
3. Similar nodes by count of shared edges
4. Similar nodes by count of shared edges, typed
5. Shared pairs for a node, by popularity
6. Node similarity as weighted sum of shared pairs
NOTE: Saved queries at FactForge.net (named SH*)
#58

o Experiment with better TF.IDF normalizations
o Take into consideration similarities between Ps and Os
ü <locatedIn,Austin> and <hasOfficeIn,Texas> are not unrelated
o Combine with other context info: co-occurrence, RDFRank, …
o Optimize the computation of such similarity
Next Steps on Similarity in KG

Thank you!
Experience the technology with our demonstrators
NOW: Semantic News Portal http://now.ontotext.com
RANK: News popularity ranking for companies http://rank.ontotext.com
FactForge: Hub for open data and news about People and Organizations
http://factforge.net
#61

Cognitive Graph Analytics on Company Data and News

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cognitive Graph Analytics on Company Data and News

Similar to Cognitive Graph Analytics on Company Data and News (20)

More from Ontotext

More from Ontotext (16)

Recently uploaded

Recently uploaded (20)

Cognitive Graph Analytics on Company Data and News