SlideShare a Scribd company logo
1 of 19
Download to read offline
“Exploiting Wikipedia for Entity
Name Disambiguation in Tweets”
Muhammad Atif Qureshi
Colm O'Riordan
Gabriella Pasi
06/16/14 2
Contents
● Introduction
● Related Work
● Methodology
● Evaluation
● Conclusion
06/16/14 3
06/16/14 4
Motivation
● Social media users voice their opinions about various
entities/brands (e.g., musicians, movies, companies)
● So that's an implicit feedback for an entity/brand
● This has recently given birth to a new area within the
marketing domain known as “online reputation
management”
06/16/14 5
Problem Statement
● Given a set of tweets collected after issuing a
query of entity (brand) name, the task is to
determine which of the tweets are related to
the entity and which are not
● Decide if tweet is related to Apple Inc.
– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
06/16/14 6
Wikipedia Graph Structure
C1
A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
06/16/14 7
Related Work
● Entity Linking: to link an entity to it's correct sense
– Ferragina and Scaiella 2010 and Meij et al 2012 has
proposed strategies over tweets
● Use hyperlink structure of Wikipedia and anchor texts of the
links to those Wikipedia pages.
● Disambiguation is performed by application of a voting
function among all senses associated to anchors detected
– Meij et al 2012 employs supervised machine learning
techniques for further improvement
06/16/14 8
Methodology
● Chunking Strategy
● Entity Phrases & Categories
● Features Based on Wikipedia Articles'
Hyperlinks
● Features Based on Wikipedia Articles'
Hyperlinks
06/16/14 9
Chunking Strategy
I prefer Samsung over HTC, Apple, Nokia because it is economical and good
i prefer samsung over htc apple nokia because it is economical and good
Phrase Chunks with boundaries
samsungprefer htc apple nokia economical
Stopwords removed,
Longest phrase matched over
Wikipedia as article
06/16/14 10
Entity Phrases & Categories
Entity E1
Wikipedia Article AE1
of entity E1
List of Wikipedia
Categories CL_E1
of AE1
Sub-Categories SCL_E1
of
CL_E1
up to a depth 2
List of
Entity Phrases of E1
or ArticlesRC
Wikipedia Articles
in CL_E1
Wikipedia Articles in SCL_E1
Entity Categories or RC
Has a
Mentions inside
Has
Categories or WC (i.e., RC WC)⊂
06/16/14 11
Context PhrasesEntity Phrase
Features Based on Wikipedia
Articles' Hyperlinks
apple
Chunked tweet
Entity Phrase Senses Context Phrase Senses Avg. Max.
Sense Scoredoctor fruit
phd band medical album plant
apple (fruit) 80 45 230 6 532 381
apple (film) 10 50 0 9 0 29.5
apple (inc.) 83 20 10 5 0 44
Feature values are generated using Inlinks, outlinks, inlink+outlinks
Sense apple (inc.) is related to Entity while others were not
For entity Apple Inc.
doctor fruit
06/16/14 12
Relatedness Score Based on
Wikipedia Category-Article Structure
DepthSignificace ( p)= ∑
cat ∈RC∩ pcat
1
depthcat +1
CatSignificace ( p)=
∣RC∩ pcat∣
∣WC∩ pcat∣
∗log(∣RC∩ pcat∣+1)
PhraseSignificace( p)=log(wordlen( p)+1)× pfrequency
Relatedness Score= ∑
p∈MatchedPhrases
Depthsignificance ( p)×Catsignificance ( p)×Phrasesignificance
06/16/14 13
Dataset
● Multilingual tweets of 61 entities (25%
Spanish, 75% English)
– Training ~749 tweets for each entity
– Testing ~1481 tweets for each entity
Domains No. of
Entities
Training Testing
Non Rev Orig Trans Non Rev Orig Trans
Music 20 1461 14353 12518 3296 1998 28137 23442 6693
University 10 3548 3412 6569 391 6760 7387 13060 1087
Banking 11 2021 5753 5327 2447 4335 11635 10918 5052
Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414
Total 61 10797 34874 36999 8672 19944 70412 72110 18246
06/16/14 14
Measure
● Reliability is the product of precision in both
classes (i.e., true positives and true negatives)
● Sensitivity is the product of recall of both
classes
Reliability=
TP
TP+FP
×
TN
TN +FN
Sensitivity=
TP
TP+FN
×
TN
TN +FP
06/16/14 15
Settings
● Classifier: Random Forest
Settings Features Based on
Wikipedia Articles'
Hyperlinks
Relatedness Score Based
on Wikipedia Category-
Article Structure
Domain
Level
Training
Entity
Level
Training
hrdomain
x x x
hrentity
x x x
rdomain
x x
rentity
x x
06/16/14 16
Results
Team Reliability Sensitivity F(R,S)
POPSTAR 0.73 0.45 0.49
OUR APPROACH 0.67 0.42 0.45
SZTE NLP 0.60 0.44 0.44
LIA 0.66 0.36 0.38
BASELINE 0.49 0.32 0.33
UvA UNED 0.68 0.22 0.21
Domain Setting Reliability Sensitivity F(R,S)
Automotives hrdomain
0.54 0.47 0.47
Banking hrentity
0.75 0.58 0.49
University hrdomain
0.71 0.44 0.49
Music rentity
0.83 0.34 0.39
Evaluation Results on Test Set by Domain
Performance Comparison with Other Systems
06/16/14 17
Conclusion
● The experimental evaluations establish Wikipedia’s
strength as a significant encyclopaedic resource for
the task of entity name disambiguation in tweets.
● The relatedness score defined using Wikipedia
category-article structure introduces a powerful
semantic notion of linking n-grams in a tweet with
the information relevant to an entity
● As future work, we aim to combine our Wikipedia
based features with text based techniques to further
improve the performance
06/16/14 18
References
●
E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de
Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring
systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013.
●
P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by
wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.
●
E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12,
pages 563–572, New York, NY, USA, 2012. ACM.
●
M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for
company name disambiguation in microblog streams. In CLEF (Online Working
Notes/Labs/Workshop), 2013.
06/16/14 19
Questions
???

More Related Content

Similar to Exploiting Wikipedia for Entity Name Disambiguation in Tweets

KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-PresentationKDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
Pikakshi Manchanda
 
GLOBAL EDITIONGLOBAL EDITION GLOBAL ED
GLOBAL EDITIONGLOBAL EDITION GLOBAL EDGLOBAL EDITIONGLOBAL EDITION GLOBAL ED
GLOBAL EDITIONGLOBAL EDITION GLOBAL ED
MatthewTennant613
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
data publica
 
Designing and developing vocabularies in RDF
Designing and developing vocabularies in RDFDesigning and developing vocabularies in RDF
Designing and developing vocabularies in RDF
Open Data Support
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)
Julie Allinson
 
Communication and Social Media Guidelines and Rubric.htmlOve
Communication and Social Media Guidelines and Rubric.htmlOveCommunication and Social Media Guidelines and Rubric.htmlOve
Communication and Social Media Guidelines and Rubric.htmlOve
LynellBull52
 
Finding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesFinding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologies
mhaendel
 
Ontology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIsOntology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIs
New York City College of Technology Computer Systems Technology Colloquium
 

Similar to Exploiting Wikipedia for Entity Name Disambiguation in Tweets (20)

A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
Eprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, Mexico
 
Eprints Application Profile
Eprints Application ProfileEprints Application Profile
Eprints Application Profile
 
The Eprints Application Profile: a FRBR approach to modelling repository meta...
The Eprints Application Profile: a FRBR approach to modelling repository meta...The Eprints Application Profile: a FRBR approach to modelling repository meta...
The Eprints Application Profile: a FRBR approach to modelling repository meta...
 
KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-PresentationKDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation
 
GLOBAL EDITIONGLOBAL EDITION GLOBAL ED
GLOBAL EDITIONGLOBAL EDITION GLOBAL EDGLOBAL EDITIONGLOBAL EDITION GLOBAL ED
GLOBAL EDITIONGLOBAL EDITION GLOBAL ED
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Designing and developing vocabularies in RDF
Designing and developing vocabularies in RDFDesigning and developing vocabularies in RDF
Designing and developing vocabularies in RDF
 
Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked...
Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked...Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked...
Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked...
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)
 
Communication and Social Media Guidelines and Rubric.htmlOve
Communication and Social Media Guidelines and Rubric.htmlOveCommunication and Social Media Guidelines and Rubric.htmlOve
Communication and Social Media Guidelines and Rubric.htmlOve
 
Promise 2011: "Are Change Metrics Good Predictors for an Evolving Software Pr...
Promise 2011: "Are Change Metrics Good Predictors for an Evolving Software Pr...Promise 2011: "Are Change Metrics Good Predictors for an Evolving Software Pr...
Promise 2011: "Are Change Metrics Good Predictors for an Evolving Software Pr...
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
 
Software Citation and a Proposal (NSF workshop at Havard Medical School)
Software Citation and a Proposal (NSF workshop at Havard Medical School)Software Citation and a Proposal (NSF workshop at Havard Medical School)
Software Citation and a Proposal (NSF workshop at Havard Medical School)
 
Inferring networks of substitute and complementary products
Inferring networks of substitute and complementary productsInferring networks of substitute and complementary products
Inferring networks of substitute and complementary products
 
Finding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesFinding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologies
 
Ontology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIsOntology-based Classification and Faceted Search Interface for APIs
Ontology-based Classification and Faceted Search Interface for APIs
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 

More from M. Atif Qureshi

Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel Future
M. Atif Qureshi
 

More from M. Atif Qureshi (10)

Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendations
 
Text mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipediaText mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipedia
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Fundamentals of IR models
Fundamentals of IR modelsFundamentals of IR models
Fundamentals of IR models
 
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
 
Welcoming Webology
Welcoming WebologyWelcoming Webology
Welcoming Webology
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphere
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel Future
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Exploiting Wikipedia for Entity Name Disambiguation in Tweets

  • 1. “Exploiting Wikipedia for Entity Name Disambiguation in Tweets” Muhammad Atif Qureshi Colm O'Riordan Gabriella Pasi
  • 2. 06/16/14 2 Contents ● Introduction ● Related Work ● Methodology ● Evaluation ● Conclusion
  • 4. 06/16/14 4 Motivation ● Social media users voice their opinions about various entities/brands (e.g., musicians, movies, companies) ● So that's an implicit feedback for an entity/brand ● This has recently given birth to a new area within the marketing domain known as “online reputation management”
  • 5. 06/16/14 5 Problem Statement ● Given a set of tweets collected after issuing a query of entity (brand) name, the task is to determine which of the tweets are related to the entity and which are not ● Decide if tweet is related to Apple Inc. – “Apple tastes better than blackberry” – “Apple phones are better than blackberry”
  • 6. 06/16/14 6 Wikipedia Graph Structure C1 A1 A3 A4 C3C2 C4 C5 C6 C7 C10 C9 Category Article Category Edge Article Belonging to Category A2 Article Link
  • 7. 06/16/14 7 Related Work ● Entity Linking: to link an entity to it's correct sense – Ferragina and Scaiella 2010 and Meij et al 2012 has proposed strategies over tweets ● Use hyperlink structure of Wikipedia and anchor texts of the links to those Wikipedia pages. ● Disambiguation is performed by application of a voting function among all senses associated to anchors detected – Meij et al 2012 employs supervised machine learning techniques for further improvement
  • 8. 06/16/14 8 Methodology ● Chunking Strategy ● Entity Phrases & Categories ● Features Based on Wikipedia Articles' Hyperlinks ● Features Based on Wikipedia Articles' Hyperlinks
  • 9. 06/16/14 9 Chunking Strategy I prefer Samsung over HTC, Apple, Nokia because it is economical and good i prefer samsung over htc apple nokia because it is economical and good Phrase Chunks with boundaries samsungprefer htc apple nokia economical Stopwords removed, Longest phrase matched over Wikipedia as article
  • 10. 06/16/14 10 Entity Phrases & Categories Entity E1 Wikipedia Article AE1 of entity E1 List of Wikipedia Categories CL_E1 of AE1 Sub-Categories SCL_E1 of CL_E1 up to a depth 2 List of Entity Phrases of E1 or ArticlesRC Wikipedia Articles in CL_E1 Wikipedia Articles in SCL_E1 Entity Categories or RC Has a Mentions inside Has Categories or WC (i.e., RC WC)⊂
  • 11. 06/16/14 11 Context PhrasesEntity Phrase Features Based on Wikipedia Articles' Hyperlinks apple Chunked tweet Entity Phrase Senses Context Phrase Senses Avg. Max. Sense Scoredoctor fruit phd band medical album plant apple (fruit) 80 45 230 6 532 381 apple (film) 10 50 0 9 0 29.5 apple (inc.) 83 20 10 5 0 44 Feature values are generated using Inlinks, outlinks, inlink+outlinks Sense apple (inc.) is related to Entity while others were not For entity Apple Inc. doctor fruit
  • 12. 06/16/14 12 Relatedness Score Based on Wikipedia Category-Article Structure DepthSignificace ( p)= ∑ cat ∈RC∩ pcat 1 depthcat +1 CatSignificace ( p)= ∣RC∩ pcat∣ ∣WC∩ pcat∣ ∗log(∣RC∩ pcat∣+1) PhraseSignificace( p)=log(wordlen( p)+1)× pfrequency Relatedness Score= ∑ p∈MatchedPhrases Depthsignificance ( p)×Catsignificance ( p)×Phrasesignificance
  • 13. 06/16/14 13 Dataset ● Multilingual tweets of 61 entities (25% Spanish, 75% English) – Training ~749 tweets for each entity – Testing ~1481 tweets for each entity Domains No. of Entities Training Testing Non Rev Orig Trans Non Rev Orig Trans Music 20 1461 14353 12518 3296 1998 28137 23442 6693 University 10 3548 3412 6569 391 6760 7387 13060 1087 Banking 11 2021 5753 5327 2447 4335 11635 10918 5052 Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414 Total 61 10797 34874 36999 8672 19944 70412 72110 18246
  • 14. 06/16/14 14 Measure ● Reliability is the product of precision in both classes (i.e., true positives and true negatives) ● Sensitivity is the product of recall of both classes Reliability= TP TP+FP × TN TN +FN Sensitivity= TP TP+FN × TN TN +FP
  • 15. 06/16/14 15 Settings ● Classifier: Random Forest Settings Features Based on Wikipedia Articles' Hyperlinks Relatedness Score Based on Wikipedia Category- Article Structure Domain Level Training Entity Level Training hrdomain x x x hrentity x x x rdomain x x rentity x x
  • 16. 06/16/14 16 Results Team Reliability Sensitivity F(R,S) POPSTAR 0.73 0.45 0.49 OUR APPROACH 0.67 0.42 0.45 SZTE NLP 0.60 0.44 0.44 LIA 0.66 0.36 0.38 BASELINE 0.49 0.32 0.33 UvA UNED 0.68 0.22 0.21 Domain Setting Reliability Sensitivity F(R,S) Automotives hrdomain 0.54 0.47 0.47 Banking hrentity 0.75 0.58 0.49 University hrdomain 0.71 0.44 0.49 Music rentity 0.83 0.34 0.39 Evaluation Results on Test Set by Domain Performance Comparison with Other Systems
  • 17. 06/16/14 17 Conclusion ● The experimental evaluations establish Wikipedia’s strength as a significant encyclopaedic resource for the task of entity name disambiguation in tweets. ● The relatedness score defined using Wikipedia category-article structure introduces a powerful semantic notion of linking n-grams in a tweet with the information relevant to an entity ● As future work, we aim to combine our Wikipedia based features with text based techniques to further improve the performance
  • 18. 06/16/14 18 References ● E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013. ● P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM. ● E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM. ● M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for company name disambiguation in microblog streams. In CLEF (Online Working Notes/Labs/Workshop), 2013.