4. 06/16/14 4
Motivation
● Social media users voice their opinions about various
entities/brands (e.g., musicians, movies, companies)
● So that's an implicit feedback for an entity/brand
● This has recently given birth to a new area within the
marketing domain known as “online reputation
management”
5. 06/16/14 5
Problem Statement
● Given a set of tweets collected after issuing a
query of entity (brand) name, the task is to
determine which of the tweets are related to
the entity and which are not
● Decide if tweet is related to Apple Inc.
– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
6. 06/16/14 6
Wikipedia Graph Structure
C1
A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
7. 06/16/14 7
Related Work
● Entity Linking: to link an entity to it's correct sense
– Ferragina and Scaiella 2010 and Meij et al 2012 has
proposed strategies over tweets
● Use hyperlink structure of Wikipedia and anchor texts of the
links to those Wikipedia pages.
● Disambiguation is performed by application of a voting
function among all senses associated to anchors detected
– Meij et al 2012 employs supervised machine learning
techniques for further improvement
8. 06/16/14 8
Methodology
● Chunking Strategy
● Entity Phrases & Categories
● Features Based on Wikipedia Articles'
Hyperlinks
● Features Based on Wikipedia Articles'
Hyperlinks
9. 06/16/14 9
Chunking Strategy
I prefer Samsung over HTC, Apple, Nokia because it is economical and good
i prefer samsung over htc apple nokia because it is economical and good
Phrase Chunks with boundaries
samsungprefer htc apple nokia economical
Stopwords removed,
Longest phrase matched over
Wikipedia as article
10. 06/16/14 10
Entity Phrases & Categories
Entity E1
Wikipedia Article AE1
of entity E1
List of Wikipedia
Categories CL_E1
of AE1
Sub-Categories SCL_E1
of
CL_E1
up to a depth 2
List of
Entity Phrases of E1
or ArticlesRC
Wikipedia Articles
in CL_E1
Wikipedia Articles in SCL_E1
Entity Categories or RC
Has a
Mentions inside
Has
Categories or WC (i.e., RC WC)⊂
11. 06/16/14 11
Context PhrasesEntity Phrase
Features Based on Wikipedia
Articles' Hyperlinks
apple
Chunked tweet
Entity Phrase Senses Context Phrase Senses Avg. Max.
Sense Scoredoctor fruit
phd band medical album plant
apple (fruit) 80 45 230 6 532 381
apple (film) 10 50 0 9 0 29.5
apple (inc.) 83 20 10 5 0 44
Feature values are generated using Inlinks, outlinks, inlink+outlinks
Sense apple (inc.) is related to Entity while others were not
For entity Apple Inc.
doctor fruit
13. 06/16/14 13
Dataset
● Multilingual tweets of 61 entities (25%
Spanish, 75% English)
– Training ~749 tweets for each entity
– Testing ~1481 tweets for each entity
Domains No. of
Entities
Training Testing
Non Rev Orig Trans Non Rev Orig Trans
Music 20 1461 14353 12518 3296 1998 28137 23442 6693
University 10 3548 3412 6569 391 6760 7387 13060 1087
Banking 11 2021 5753 5327 2447 4335 11635 10918 5052
Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414
Total 61 10797 34874 36999 8672 19944 70412 72110 18246
14. 06/16/14 14
Measure
● Reliability is the product of precision in both
classes (i.e., true positives and true negatives)
● Sensitivity is the product of recall of both
classes
Reliability=
TP
TP+FP
×
TN
TN +FN
Sensitivity=
TP
TP+FN
×
TN
TN +FP
15. 06/16/14 15
Settings
● Classifier: Random Forest
Settings Features Based on
Wikipedia Articles'
Hyperlinks
Relatedness Score Based
on Wikipedia Category-
Article Structure
Domain
Level
Training
Entity
Level
Training
hrdomain
x x x
hrentity
x x x
rdomain
x x
rentity
x x
16. 06/16/14 16
Results
Team Reliability Sensitivity F(R,S)
POPSTAR 0.73 0.45 0.49
OUR APPROACH 0.67 0.42 0.45
SZTE NLP 0.60 0.44 0.44
LIA 0.66 0.36 0.38
BASELINE 0.49 0.32 0.33
UvA UNED 0.68 0.22 0.21
Domain Setting Reliability Sensitivity F(R,S)
Automotives hrdomain
0.54 0.47 0.47
Banking hrentity
0.75 0.58 0.49
University hrdomain
0.71 0.44 0.49
Music rentity
0.83 0.34 0.39
Evaluation Results on Test Set by Domain
Performance Comparison with Other Systems
17. 06/16/14 17
Conclusion
● The experimental evaluations establish Wikipedia’s
strength as a significant encyclopaedic resource for
the task of entity name disambiguation in tweets.
● The relatedness score defined using Wikipedia
category-article structure introduces a powerful
semantic notion of linking n-grams in a tweet with
the information relevant to an entity
● As future work, we aim to combine our Wikipedia
based features with text based techniques to further
improve the performance
18. 06/16/14 18
References
●
E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de
Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring
systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013.
●
P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by
wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.
●
E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12,
pages 563–572, New York, NY, USA, 2012. ACM.
●
M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for
company name disambiguation in microblog streams. In CLEF (Online Working
Notes/Labs/Workshop), 2013.