Ontotext introduced their cognitive analytics platform that performs cognitive graph analytics on company data and news. The platform builds large knowledge graphs by integrating data from multiple sources and uses text mining to link news articles to entities in the knowledge graph. It provides functionality for node ranking, similarity analysis and data cleaning to consolidate and reconcile company records across datasets. The platform was demonstrated through a knowledge graph containing over 2 billion facts built by integrating datasets like DBpedia, Geonames, and news article metadata.
3. History and Key Facts
o Started in year 2000 as Semantic Web pioneer
ü As R&D lab within Sirma; still part of Sirma Group – 400 persons, public company
ü Got spun-off and took VC investment in 2008
o 65 staff; R&D center in Bulgaria
ü Over 400 person-years invested in R&D
ü Multiple innovation & technology awards: Washington Post, BBC, FT, BAIT, etc.
o 80% sales in USA and UK
o Member of multiple industry bodies
ü W3C, EDMC, ODI, LDBC, STI, DBPedia Association
12. Cognitive Reading Machines
o Knowledge and awareness
ü Factual information about concepts, entities, events , …
o Reading capabilities:
ü Lexical and grammatical knowledge to comprehend text
ü Recognize concepts, entities and events
o Resolving ambiguity: whether “Paris” refers to "Paris, France“, “Paris, TX” or "Paris Hilton"
o Learning capabilities: learn new facts from text
ü Read and analyze information in order to extend their knowledge
o Get better entity profiles and relations
ü Get better in recognizing concepts and entities
16. Commercial Company
Database
(e.g. D&B)
Social Media
News
Wikipedia
Privateo Knowledge Graphs
provide semantic
entity fingerprints
ü Each node contributes to the
description of the others
ü Network topology suggest
importance, similarity, etc.
o Such Graphs should
combine diverse
information
Knowledge Graphs
As Context
20. Oct
2016
o POL data is the most common type of master/reference data
ü Considering business applications and news
o Open POL data available in vast quantities
ü Geonames covers locations exhaustively; DBPedia covers well popular POL entities; Wikidata, …
ü Open company data grows: OpenCorporates, GLEI, open national registers, various “data leaks”
ü Within 3 years exhaustive global POL data will be commodity!
o Ontotext delivers Global POL data solutions today
ü Integrating open and commercial POL data – a billion triples demo at http://factforge.net
ü Looking up and disambiguating references to these entities in the text, http://now.ontotext.com
ü Rankings and trends for each of these entities by news popularity – demo http://rank.ontotext.com
Person, Organization, Location (POL) Data
22. o GraphDB’s Lucene Connector used for preselection
ü GraphDB Lucene connector index is created for Organizations and People from DBPedia,
indexing all sorts of names, descriptions and other textual information for each entity
ü The name of the entity from the 3rd party dataset (e.g. Panama Papers or GLEI) is used as a
FTS query, embedded in a SPARQL query
ü Next filter by location, industry and other criteria (e.g. RDFRank, similarity, … )
o What is that Lucence does better than SPARQL?
ü When there is little information other than the name, Lucene deals well with minor syntactic
variations and sorts the results by relevance
ü When mapping 300 000 organizations against 500 000 organizations, without a key, the
complexity of a SPARQL query is 300 000 x 500 000, way slower that 300 000 Lucene queries
Mapping datasets to DBPedia (2)
24. Oct
2016
Company Data Species (1/2)
Category Representatives Size (Orgs.)
Exhaustive Global Databases Dun & Bradstreet, BvD, Factset > 200M
Rich Company Databases Capital IQ (S&P), Thomson Reuters (various) 5-10M
Investment Databases CrunchBase, PitchBook, CBI, DJ Venture Source 200-600K
Very Big Open Databases OpenCorporates 130M
Global Official Open Databases GLEI (Global Legal Identifier) 1M
Open Encyclopedic DBPeida, Wikidata 300-500K
Open Leaks and Investigations Panama Papers (Offshore Leaks), Trump World Data 3-300K
26. Oct
2016
o Similar methodology as for matching open data to DBPedia
o Intensive usage of normalized locations and industry sectors
ü Locations across all datasets mapped to Geonames
ü Matching industry classification across the sources (e.g. to GICS)
o Use of identity-suggesting attributes (e.g. website and ticker)
ü Sounds like low hanging fruit … but non of those is trouble-free
o When matching 5+ datasets, precision becomes critical
ü A cluster of identifiers of one and the same company identifiers can contain 5
ü If one of them is included by mistake, the whole cluster is compromised
Matching Commercial Company Datasets
33. FactForge: Data Integration
o DBpedia (the English version) 496M
o Geonames (all geographic features on Earth) 150M
o owl:sameAs links between DBpedia and Geonames 471K
o GLEI (global company register data) 3M
o Panama Papers DB (#LinkedLeaks) 20M
o Other datasets and ontologies: WordNet, WorldFacts, FIBO
o News metadata (2000 articles/day enriched by NOW) 673M
o Total size (1.8B explicit + 328M inferred statements) 2 168М
34. News Metadata
o Metadata from Ontotext’s Dynamic Semantic Publishing platform
ü News stream from Google
ü Automatically generated as part of the NOW.ontotext.com semantic news showcase
o News stream from Google since Feb 2015, about 50k news/month
ü Over 1M news articles at present
ü ~70 tags (annotations) per news article
o Tags link text mentions of concepts to the knowledge graph
ü Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases
44. o The RDFRank plug-in of GraphDB™ calculates PageRank importance
ü It identifies “important” nodes in an RDF graph based on their interconnectedness
o The values can be accessed using the rank:hasRDFRank predicate
o Incremental RDF Rank calculation is useful for dynamic data
o For example, to select the top 100 important nodes in the RDF graph:
RDFRank: Importance of Node in a Graph
PREFIX rank: http://www.ontotext.com/owlim/RDFRank#
SELECT * { ?n rank:hasRDFRank ?r }
ORDER BY DESC(?r)
LIMIT 100
50. News Popularity Ranking of Companies
o Can be customized by industry, region and time period
o Unique features:
ü It is based on live streaming news
ü Tracks also mentions of subsidiaries
o Rank uses the industry sectors of DBPedia with several refinements
ü About 40 top-industry sectors
ü Sectors are linked in a hierarchical taxonomy (all together 251 sectors)
ü Industry sectors are de-duplicated (all designators used in Wikipedia are about 9 000)
o Try rank.ontotext.com
#50
55. o Adapt Information Retrieval (IR) ideas
o IR’s basic is the Vector space model
ü Documents are vectors in a geometrical space that has one dimension for each word
ü Coordinates correspond to the number of times that the word appears in the document
ü Imagine there are only 5 words used in all documents: a, ab, acb, ba, bc
ü Document content: “a bc ba a abc”
ü Document vector = <2,0,1,1,1>
ü Similarity is measure by the COS of the angle between the vectors
ü Very popular and simple. Read more at https://en.wikipedia.org/wiki/Vector_space_model
How It Can Work?
#55
56. o A simple model: Nodes are described by the outgoing edges
ü I.e. PO-pairs <predicate,object> (key and value, relation type and target)
ü Concern: there are too many different PO-pairs in a big KG
o For FactForge, it would be a space with 100M+ dimensions …
o Assumption: two nodes are more similar if they share more PO-pairs
o I.e. they are connected to the same other nodes with the same types of relationships
ü Weighted sum: weight PO-pairs based on their “information value”
ü More “popular” pairs are less informative; known as TF.IDF
o Experiment with node similarity in FactForge.net
Vector Space Model for Knowledge Graphs
#56