Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations

Leveraging Wikipedia-based
Features for Entity Relatedness and
Recommendations
Nitish Aggarwal
Supervised by Dr. Paul Buitelaar
PhD Viva

Motivation
Semantic Web
Technologies:
1. RDF
2. SPARQL
3. Ontology
4. Linked data
5. Turtle (syntax)
Entity Recommendation
Companies:
1. Metaweb
2. Ontoprise GmbH
3. OpenLink Software
4. Ontotext
5. Powerset (company)
Myosin
Proteins and cells:
1. Actin
2. Muscle contraction
3. Sarcomere
4. Myofibril
5. Cytoskeleton
Biologists:
1. Hugh Huxley
2. James Spudich
3. Ronald Vale
4. Manuel Morales
5. Brunó Ferenc Straub
3

Determine the degree of relatedness between two entities
Brad Pitt Tom Cruise
?
Entity Relatedness
4

Person, location,
organization
Time, date, money,
percent
Event, movie, disease,
symptom, side effect,
law, license and more
Background
Entity
• Many such types are covered in
Wikipedia
• More than 2K classes in DBpedia
• More than 350k classes in Yago
• Every Wikipedia article is
considered about an entity
5

Motor vehicle
Car
Motorcycle
Automobile
Auto
Car seat
Car window
s
s
h h
m m
Background
Relatedness
Synonym
s
Similar
Related
Substitutability
6

Outline
• Motivation
• Entity Relatedness
• Distributional Semantics for Entity Relatedness (DiSER)
• Evaluation
• Entity Recommendation
• Wikipedia-based Features for Entity Recommendation (WiFER)
• Evaluation
• Text Relatedness
• Non-Orthogonal Explicit Semantic Analysis (NESA)
• Evaluation
• Application and Industry Use Cases
• Conclusion
7

Wikipedia Features for
(WiFER)
Feature
Extraction
Thesis Overview
Distributional Semantic for
Entity Relatedness (DiSER)
Distributional
Representation
Non-Orthogonal Explicit
Semantic Analysis (NESA)
Chapter V
Chapter IV
Chapter VI
8

Motivation Entity Relatedness Entity Recommendation Text Relatedness Application Conclusion
Thesis Overview
(WiFER)
Feature
Extraction
Distributional
Representation
Chapter IV
9

Entity Relatedness
10

Entity Relatedness: State of the Art
• Graph-based methods
• Path distance in Wikipedia graph (Strube and Ponzetto, 2006)
• Normalized Google Distance on Wikipedia graph (Witten and Milne, 2008)
• Personalized pagerank on Wikipedia graph (Agirre et. al, 2015)
• Path-based measures on DBpedia graph (Hulpus et. al, 2015)
• Corpus-based methods
• Key-phrase Overlap for Related Entities (KORE): partial overlaps between key-
phrases in corresponding Wikipedia articles (Hoffart et. al, 2012)
• Text relatedness measures: use colocation information in text
11

Explicit Semantic Analysis (ESA)
Uses explicit (manually defined) concepts like Wikipedia articles where every article
is considered describing a single concept (Gabrilovich and Markovitch, 2007)
Distributional Semantics
word1 W11 W12 W13 W14 …....... W1n
word2 W21 W22 W23 W24 …....... W2n
word3 W31 W32 W33 W34 …........ W3n
wordm Wm1 Wm2 Wm3 Wm4 …... Wmn
...
doc1 doc2 doc3 doc4 ….... docn
12

word1 W11 W12 W13 W14 …....... W1n
word2 W21 W22 W23 W24 …....... W2n
word3 W31 W32 W33 W34 …........ W3n
wordm Wm1 Wm2 Wm3 Wm4 …... Wmn
...
Distributional Semantics
Implicit/Latent Semantic Analysis (LSA)
Transforms sparse document space into a dense latent topic space
Latent Dirichlet
Allocation (LDA)
(Blei et al., 2003)
Latent Semantic
Analysis (LSA)
(Deerwester et al.,
1990)
Neural Embeddings
(Word2Vec)
(Mikolov et al., 2013)
n ~ 1M
word1 W11 W12 ……..... W1k
word2 W21 W22 ……..... W2k
wordm Wm1 Wm2 ……..... Wmk
...
topic1 topic2 … topick
Dimensionality
Reduction
k < 1000
13

Limitation of Text Relatedness Measures
• Compositionality
• Most of the entities are multiword expressions
• Vector(Brad Pitt) = Vector(Brad) + Vector(Pitt) ?
• Ambiguity
• Vector of an entity with ambiguous name like “Nice” (French city)
14

Chapter IV
Distributional Semantics for Entity
Relatedness (DiSER)
entity1 W11 W12 W13 W14 …....... W1n
entity2 W21 W22 W23 W24 …....... W2n
entity3 W31 W32 W33 W34 …........ W3n
entityn Wn1 Wn2 Wn3 Wn4 …... Wnn
...
Wikipedia-based Distributional Semantics for Entity Relatedness
In: AAAI-FSS-2014
[Steve Jobs] co-founded Apple in 1976 to sell
Wozniak’s [Apple I] [Personal Computer]. [Steve
Jobs | Jobs] was CEO of [Apple Inc. | Apple] and
largest shareholder of [Pixar]. Jobs is widely
recognized as a pioneer of the [Microcomputer
Revolution], along [Steve Wozniak | Wozniak].
Annotated
Wikipedia with
entities
One sense per document
Wikipedia entities
[Steve Jobs] [Apple Inc.| Apple] [Steve Wozniak |
Wozniak]’ [Apple I] [Personal Computer]. [Steve
Jobs | Jobs] was CEO of [Apple Inc. | Apple] and
largest shareholdef [Pixar]. [Steve Jobs | Jobs] is
widely recognizpioneer of the [Microcomputer
Revolution], along [Steve Wozniak | Wozniak].
15

The Tree of Life (film)
Falmouth, Cornwall
World War Z (film)
What Just Happened
A Mighty Heart (film)
Plan B Entertainment
Jamaican Patois
Richard: A Novel
Sobriquet
I Want a Famous Face
Brad Pitt (DiSER)
Damiani (jewelry company)
University of Pittsburgh Band
Brad Pitt
Make It Right Foundation
Pittsburgh men’s basketball
Brangelina
Pittsburgh Panthers baseball
Pitt (Comics)
Pitt River
Brad Pitt filmography
Brad Pitt (ESA)
In: AAAI-FSS-2014
ESA vs DiSER Vector
Chapter IV
16

Entity Relatedness: Evaluation
17

• Absolute relatedness score
• Relatedness between “Apple Inc.” and “Steve Jobs”
• Very low inter-annotator agreement
• Relative relatedness score
• Is “Steve Jobs” more related with “Apple Inc.” than “Bill Gates”
• High inter-annotator agreement
• KORE (Hoffart et al., 2012)
• 21 seed entities
• Every entity has list of 20 entities with their relatedness score
• 420 entity pairs in total
Entity Relatedness: Dataset
18

Approaches
Spearman Rank
Correlation
Graph-based
measures
Path-DBpedia (Hulpus et al., 2015) 0.610
WLM (Witten and Milne, 2008) 0.659
PPR (Agirre et al., 2015) 0.662
Corpus-based
measures
Word2Vec (Mikolov et al., 2013) 0.181
GloVe (Pennington et al., 2014) 0.194
LSA (Landauer et al., 1998) 0.375
KORE (Hoffart et al., 2012) 0.679
ESA (Gabrilovich and Markovitch, 2007) 0.691
DiSER 0.781
In: AAAI-FSS-2014
Results: KORE Dataset
19

DiSER Vector for non-Wikipedia Entities
20

BBC: http://www.bbc.com/news/world-europe-22204377
Article about Savita
Context-DiSER
Noun phrase extraction:
StanfordNLP
Entity linking:
Prior probability
21

Abortion
Abortion-rights
movement
The Irish Times
United States pro-
life movement
Vincent
Browne
Michael D.
Higgins
Context-DiSER
Irish abortion law
Death of Savita
Galway University
Hospital
Miscarriage
Catholic Country
…….
Savita
Halappanavar
22

Approaches
Spearman Rank
Correlation
KORE (state of the art) 0.679
Context-ESA 0.684
Context-DiSER (Manual linking) 0.769
Context-DiSER (Automatic linking) 0.719
In: AAAI-FSS-2014
Context-DiSER: Results on KORE Dataset
23

Thesis Overview
(WiFER)
Feature
Extraction
Distributional
Representation
Chapter V
Chapter IV
24

25

• Classical Recommendation Systems
• Focus on personalized recommendation
• Require user-item preferences
• Entity Recommendation in Web Search (Blanco et al.,
2013)
• Co-occurrence features: query logs, query session, Flickr tags, tweets
• Graph-based features: shared connections in Yahoo knowledge graph and
others domain specific knowledge bases
• Entity and Relation type in Knowledge graph
• More than 100 features
• Combines features using learning to rank
Entity Recommendation: State of the Art
26

Features:
Prior Probability of Entity1
Joint Probability
Conditional Probability
Reverse Conditional Probability
Cosine Similarity
Pointwise Mutual Information
Distributional Semantic Model
Learning to
Rank
Leveraging Wikipedia Knowledge for Entity Recommendations
In: ISWC 2015
Wikipedia-based Features for Entity
Recommendation (WiFER)
27

Joint Probability
Cosine Similarity
Distributional Semantic Model (ESA)
Wikipedia Text Wikipedia Entities
Joint Probability
Cosine Similarity
Distributional Semantic Model (DiSER)
Wikipedia-based Features for Entity
Recommendation (WiFER)
28

• Learning to Rank
• Gradient Boosted Decision Trees (GBDT) (Li Hang, 2011)
• It builds the model in a stage-wise fashion
• Dataset: Entity recommendation in web search
• 4,797 web search queries (entities)
• Every entity query has a list of entity candidates (47,623 entity-pairs)
• All candidates are tagged on 5 label scales: Excellent, Prefer, Good, Fair,
and Bad
Combining Features
Type Total instances Percentage
Location 22,062 46.32
People 21,626 45.41
Movies 3,031 6.36
TV Shows 280 0.58
Album 563 1.18
Total 47,623 100
29

• Evaluation
• Normalized discounted cumulative gain (NDCG@10)
• 10 fold cross validation
Features All Person Location
Spark (Blanco
et al., 2013)
0.9276 0.9479 0.8882
WiFER 0.9173 0.9431 0.8795
Spark+WiFER 0.9325 0.9505 0.8987
Insights into Entity Recommendation in Web Search
In: IESD at ISWC, 2015
Entity Recommendation: Results
30

Insights into Entity Recommendation in Web Search
In: IESD at ISWC, 2015
Entity Recommendation: Feature Analysis in
Spark+WiFER
Relation type
Cosine similarity over Flickr tags
Probability of target entity over Wikipedia
text corpus
CF7 over Flickr tags
DSM over Wikipedia entities corpus
(DiSER)
Conditional user probability over query terms
DSM over Wikipedia text corpus (ESA)
Probability of source entity over
Wikipedia entities corpus
Probability of target entity over Flickr tags
Probability of target entity over Wikipedia
entities corpus
31

Thesis Overview
(WiFER)
Feature
Extraction
Distributional
Representation
Chapter V
Chapter IV
Chapter VI
32

Text Relatedness:
Non-Orthogonal Explicit Semantic Analysis (NESA)
33

ESA assumes that related words share highly
weighted concepts in their distributional vector
Chapter VI
Improving ESA with Document Similarity
In: ECIR-2013
“soccer”
History of Soccer in the United States
Soccer in the United States
United States Soccer Federation
North American Soccer League
United Soccer Leagues
“football”
FIFA
Football
History of association football
Football in England
Association football
ESA(football, soccer) = 0.0
Orthogonality in ESA
34

Chapter VI
Improving ESA with Document Similarity
In: ECIR-2013
“soccer”
History of Soccer in the United States
Soccer in the United States
United States Soccer Federation
North American Soccer League
United Soccer Leagues
“football”
FIFA
Football
History of association football
Football in England
Association football
NESA(football, soccer) = (FIFA x Soccer in the United States +
FIFA x United Soccer Leagues ….) = 0.38
Non-Orthogonal Explicit Semantic Analysis
(NESA)
35

• ESA: v1 and v2 are the n-dimensional vectors for words w1 and w2
• relESA (w1, w2) = v1
T . v2
• NESA: Correlation between vector dimensions
• relNESA (w1,w2) = v1
T . C . v2
• C(n,n) = ET . E
• Dimension correlation methods
• DiSER scores between corresponding Wikipedia article
(NESA)
36

• WN353
• 353 word pairs annotated by 13-15 experts on a scale of 1-10.
• RG65
• 65 word pairs annotated by 51 experts on scale of 0-4
• MC30
• 30 word pairs annotated by 38 experts on scale of 0-1
• MT287
• 287 word pairs annotated by 10-12 experts on scale of 0-1
Word Relatedness Datasets
37

In: *SEM-2015
Chapter VI
WN353 MC30 RG65 MT287
LSA 0.579 0.667 0.616 0.555
LSA (Wiki) 0.538 0.744 0.697 0.353
Word2Vec 0.663 0.824 0.751 0.560
ESA 0.66 0.765 0.826 0.507
NESA 0.696 0.784 0.839 0.572
Spearman rank correlation with word similarity gold standard datasets
NESA: Results
38

In: *SEM-2015
Chapter VI
NESA: Results
• Word similarity vs relatedness (Agirre et al., 2009)
• WN353Rel: 202 word pairs from WN353
• WN353Sim: 252 word pairs from WN353
Spearman rank correlation with word similarity vs relatedness datasets
WN353Rel WN353Sim
LSA 0.521 0.662
LSA (Wiki) 0.506 0.559
Word2Vec 0.601 0.741
ESA 0.643 0.663
NESA 0.663 0.719
39

Outline
• Motivation
• Evaluation
• Evaluation
• Evaluation
• Application and Industry Use Cases
• Conclusion
40

Chapter VIIhttp://enrg.insight-centre.org/
41

42
EnRG SPARQL Endpoint
National University of Ireland, Galway

Industrial Use Cases
Medical entity linking for question-answering
and relationship explanation in Knowledge
Graph
Entity Recommendation in Web Search
Company name disambiguation for social
profiling
43

• Outperformed state of the art entity relatedness measures
• Effective features for entity recommendation in web search
• Outperformed other existing word relatedness measures
• Entity Relatedness Graph (EnRG)
• Contains all Wikipedia entities and their pre-computed relatedness scores
• Contains distributional vectors for all Wikipedia entities
Conclusion
45

• Relationship explanation for recommended entities
• Best path in knowledge graph
• Best natural language description
• Knowledge discovery
• Analogy querying over knowledge graph
e.g. Google to Motorola => Microsoft to ?
• Example based querying
e.g. Google to Motorola => ? to ?
Future Research Directions
46

Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations

Similar to Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations (20)

Recently uploaded

Recently uploaded (20)

Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations

Editor's Notes