SlideShare a Scribd company logo
1 of 89
Download to read offline
Entity Linking
10th Russian Summer School in Information Retrieval (RuSSIR 2016) | Saratov, Russia, 2016
Krisztian Balog
University of Stavanger

@krisztianbalog
From Named Entity
Recognition to Entity Linking
Named entity recognition (NER)
- Also known as entity identification, entity extraction, and
entity chunking

- Task: identifying named entities in text and labeling them with
one of the possible entity types

- Person (PER), organization (ORG), location (LOC), miscellaneous
(MISC). Sometimes also temporal expressions (TIMEX) and certain
types of numerical expressions (NUMEX)
<LOC>Silicon Valley</LOC> venture capitalist <PER>Michael
Moritz</PER> said that today's billion-dollar "unicorn" startups
can learn from <ORG>Apple</ORG> founder <PER>Steve Jobs</PER>
Named entity disambiguation
- Also called named entity normalization and named entity
resolution

- Task: assign ambiguous entity names to canonical entities
from some catalog

- It is usually assumed that entities have already been recognized in the
input text (i.e., it has been processed by a NER system)
Wikification
- Named entity disambiguation using Wikipedia as the catalog
of entities

- Also annotating concepts, not only entities
Entity linking
- Task: recognizing entity mentions in text and linking them to
the corresponding entries in a knowledge base (KB)

- Limited to recognizing entities for which a target entry exists in the
reference KB; each KB entry is a candidate
- It is assumed that the document provides sufficient context for
disambiguating entities
- Knowledge base (working definition):

- A catalog of entities, each with one or more names (surface forms),
links to other entities, and, optionally, a textual description
- Wikipedia, DBpedia, Freebase, YAGO, etc.
Overview of entity annotation tasks
Task Recognition Assignment
Named entity recognition entities entity type
Named entity disambiguation entities entity ID / NIL
Wikification entities and concepts entity ID / NIL
Entity linking entities entity ID
Entity linking in action
Entity linking in action
Anatomy of an entity linking system
Mention
detection
Candidate
selection
Disambiguation
entity
annotations
document
Identification of text snippets that
can potentially be linked to entities
Generating a set of candidate
entities for each mention
Selecting a single entity (or none) for
each mention, based on the context
Mention detection
Mention
detection
Candidate
selection
Disambiguation
entity
annotations
document
Mention detection
- Goal: Detect all “linkable” phrases

- Challenges:

- Recall oriented
- Do not miss any entity that should be linked
- Find entity name variants
- E.g. “jlo” is name variant of [Jennifer Lopez]
- Filter out inappropriate ones
- E.g. “new york” matches >2k different entities
Common approach
1. Build a dictionary of entity surface forms

- Entities with all names variants
2. Check all document n-grams against the dictionary

- The value of n is set typically between 6 and 8
3. Filter out undesired entities

- Can be done here or later in the pipeline
Example
Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites,
Entities (Es)Surface form (s)
… …
Times_Square
Times_Square_(Hong_Kong)
Times_Square_(IRT_42nd_Street_Shuttle)
…
Times Square
……
Empire State Building Empire_State_Building
Empire State
Empire_State_(band)
Empire_State_Building
Empire_State_Film_Festival
Empire
British_Empire
Empire_(magazine)
First_French_Empire
Galactic_Empire_(Star_Wars)
Holy_Roman_Empire
Roman_Empire
…
New York City is a fast-paced, globally influential center of art, culture, fashion and finance.
Surface form dictionary construction 

from Wikipedia
- Page title
- Canonical (most common) name
of the entity
Surface form dictionary construction 

from Wikipedia
- Page title

- Redirect pages
- Alternative names that are
frequently used to refer to an entity
Surface form dictionary construction 

from Wikipedia
- Page title

- Redirect pages

- Disambiguation pages
- List of entities that share the same
name
Surface form dictionary construction 

from Wikipedia
- Page title

- Redirect pages

- Disambiguation pages

- Anchor texts
- of links pointing to the entity's
Wikipedia page
Surface form dictionary construction 

from Wikipedia
- Page title

- Redirect pages

- Disambiguation pages

- Anchor texts

- Bold texts from first paragraph
- generally denote other name variants
of the entity
Surface form dictionary construction 

from other sources
- Anchor texts from external web pages pointing to Wikipedia
articles

- Problem of synonym discovery

- Expanding acronyms
- Leveraging search results or query-click logs from a web search engine
- ...
Filtering mentions
- Filter our mentions that are unlikely to be linked to any entity

- Keyphraseness:

- Link probability:
P(keyphrase|m) =
|Dlink(m)|
|D(m)|
P(link|m) =
link(m)
freq(m)
number of Wikipedia articles
where m appears as a link
number of Wikipedia articles
that contain m
the number of times mention
m appears as a link
total number of times mention m
occurs in Wikipedia (as a link or not)
Overlapping entity mentions
- Dealing with them in this phase

- E.g., by dropping a mention if it is subsumed by another mention
- Keeping them and postponing the decision to a later stage
(candidate selection or disambiguation)
Candidate selection
Mention
detection
Candidate
selection
Disambiguation
entity
annotations
document
Candidate selection
- Goal: Narrow down the space of disambiguation possibilities

- Balances between precision and recall (effectiveness vs.
efficiency)

- Often approached as a ranking problem

- Keeping only candidates above a score/rank threshold for downstream
processing
Commonness
- Perform the ranking of candidate entities based on their
overall popularity, i.e., "most common sense"
P(e|m) =
n(m, e)
P
e0 n(m, e0)
the number of times entity e is
the link destination of mention m
total number of times mention m
appears as a link
Example
Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites,
New York City is a fast-paced, globally influential center of art, culture, fashion and finance.
Times_Square_(IRT_42nd_Street_Shuttle) 0.006
… …
Commonness
(P(e|m))
Entity (e)
0.011Times_Square_(Hong_Kong)
0.017Times_Square_(film)
Times_Square 0.940
Commonness
- Can be pre-computed and stored in the entity surface form
dictionary

- Follows a power law with a long tail of extremely unlikely
senses; entities at the tail end of the distribution can be safely
discarded

- E.g., 0.001 is a sensible threshold
Example
Entity Commonness
Germany 0.9417
Germany_national_football_team 0.0139
Nazi_Germany 0.0081
German_Empire 0.0065
...
Entity Commonness
FIFA_World_Cup 0.2358
FIS_Apline_Ski_World_Cup 0.0682
2009_FINA_Swimming_World_Cup 0.0633
World_Cup_(men's_golf) 0.0622
...
Entity Commonness
1998_FIFA_World_Cup 0.9556
1998_IAAF_World_Cup 0.0296
1998_Alpine_Skiing_World_Cup 0.0059
...
Bulgaria's best World Cup performance was in the
1994 World Cup where they beat Germany, to reach
the semi-finals, losing to Italy, and finishing in
fourth ...
Disambiguation
Mention
detection
Candidate
selection
Disambiguation
entity
annotations
document
Disambiguation
- Baseline approach: most common sense

- Consider additional types of evidence

- Prior importance of entities and mentions
- Contextual similarity between the text surrounding the mention and
the candidate entity
- Coherence among all entity linking decisions in the document
- Combine these signals 

- Using supervised learning or graph-based approaches
- Optionally perform pruning 

- Reject low confidence or semantically meaningless annotations
Prior importance features
- Context-independent features

- Neither the text nor other mentions in the document are taken into
account
- Keyphraseness

- Link probability

- Commonness
Prior importance features
- Link prior 

- Popularity of the entity measured in terms of incoming links
- Page views
- Popularity of the entity measured in terms traffic volume
Plink(e) =
link(e)
P
e0 link(e0)
Ppageviews(e) =
pageviews(e)
P
e0 pageviews(e0)
Contextual features
- Compare the surrounding context of a mention with the
(textual) representation of the given candidate entity

- Context of a mention

- Window of text (sentence, paragraph) around the mention
- Entire document
- Entity's representation

- Wikipedia entity page, first description paragraph, terms with highest
TF-IDF score, etc.
- Entity's description in the knowledge base
Contextual similarity
- Commonly: bag-of-words representation

- Cosine similarity
- Many other options for measuring similarity

- Dot product, KL divergence, Jaccard similarity
- Representation does not have to be limited to bag-of-words

- Concept vectors (named entities, Wikipedia categories, anchor text,
keyphrases, etc.)
simcos(m, e) =
~dm · ~de
k ~dm kk ~de k
Entity-relatedness features
- It can reasonably be assumed that a document focuses on
one or at most a few topics

- Therefore, entities mentioned in a document should be
topically related to each other

- Capturing topical coherence by developing some measure
of relatedness between (linked) entities

- Defined for pairs of entities
Wikipedia Link-based Measure (WSM)
- Often referred to simply as relatedness

- A close relationship is assumed between two entities if there
is a large overlap between the entities linking to them
WLM(e, e0
) = 1
log(max(|Le|, |Le0 |)) log(|Le  Le0 |)
log(|E|) log(min(|Le|, |Le0 |))
set of entities that link to etotal number of entities
Wikipedia Link-based Measure (WLM)
Gabrilovich and Markovitch (2007) achieve extremely
accurate results with ESA, a technique that is somewhat
reminiscent of the vector space model widely used in
information retrieval. Instead of comparing vectors of term
weights to evaluate the similarity between queries and
documents, they compare weighted vectors of the
Wikipedia articles related to each term. The name of the
approach—Explicit Semantic Analysis—stems from the
way these vectors are comprised of manually defined
of explicitly defined semantics. Despite the name, Explicit
Semantic Analysis takes advantage of only one property:
the way in which Wikipedia’s text is segmented into
individual topics. It’s central component—the weight
between a term and an article—is automatically derived
rather than explicitly specified. In contrast, the central
component of our approach is the link: a manually-defined
connection between two manually disambiguated concepts.
Wikipedia provides millions of these connections, as
Global WarmingAutomobile
Petrol
Engine
Fossil
Fuel
20th
Century Emission
Standard
Bicycle
Diesel
Engine
Carbon
Dioxide
Air
Pollution
Greenhouse
Gas
Alternative
Fuel
Transport
Vehicle
Henry
Ford
Combustion
Engine
Kyoto
Protocol
Ozone
Greenhouse
Effect
Planet
Audi
Battery
(electricity)
Arctic
Circle
Environmental
Skepticism
Greenpeace
Ecology
incoming links
outgoing links
Figure 1: Obtaining a semantic relatedness measure between Automobile and Global Warming from Wikipedia links.
incoming links
outgoing links
26
Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links.
In AAAI WikiAI Workshop.
Asymmetric relatedness features
- A relatedness function does not have to be symmetric

- E.g., the relatedness of the UNITED STATES given NEIL ARMSTRONG is
intuitively larger than the relatedness of NEIL ARMSTRONG given the
UNITED STATES
- Conditional probability
P(e0
|e) =
|Le0  Le|
|Le|
Entity-relatedness features
- Numerous ways to define relatedness

- Consider not only incoming, but also outgoing links or the union of
incoming and outgoing links
- Jaccard similarity, Pointwise Mutual Information (PMI), or the Chi-
square statistic, etc.
- Having a single relatedness function is preferred, to keep the
disambiguation process simple

- Various relatedness measures can effectively be combined
into a single score using a machine learning approach
[Ceccarelli et al., 2013]
Overview of features
Disambiguation approaches
- Consider local compatibility (including prior evidence) and
coherence with the other entity linking decisions

- Task:

- Objective function:
⇤
= arg max
X
(m,e)2
(m, e) + ( )
: Md ! E
[
{;}
coherence function for all entity
annotations in the document
local compatibility between the
mention and the assigned entity
This optimization
problem is NP-hard!
Need to resort to approximation
algorithms and heuristics
Disambiguation strategies
- Individually, one-mention-at-a-time

- Rank candidates for each mention, take the top ranked one (or NIL)
- Interdependence between entity linking decisions may be incorporated
in a pairwise fashion
- Collectively, all mentions in the document jointly
(m) = arg max
e2Em
score(m, e)
Disambiguation approaches
Approach Context Entity interdependence
Most common sense none none
Individual local
disambiguation
text none
Individual global
disambiguation
text & entities pairwise
Collective disambiguation text & entities collective
Individual local disambiguation
- Early entity linking approaches

- Local compatibility score can be written as a linear
combination of features

- Learn the "optimal" combination of features from training
data using machine learning
(e, m) =
X
i
ifi(e, m)
Can be both context-independent
and context-dependent features
Individual global disambiguation
- Consider what other entities are mentioned in the document

- True global optimization would be NP-hard

- Good approximation can be computed efficiently by
considering pairwise interdependencies for each mention
independently

- Pairwise entity relatedness scores need to be aggregated into a single
number (how coherent the given candidate entity is with the rest of the
entities in the document)
Approach #1: GLOW

[Ratinov et al., 2011]
- First perform local disambiguation (E*)

- The coherence of entity e with all other linked entities in the
document:

- Use the predictions in a second, global disambiguation round
gj(e, E?
) = [
X
e02E?{e}
fj(e, e0
)]
aggregator function (max or avg)
particular relatedness function
score(m, e) =
X
i
ifi(e, m) +
X
j
jgj(e, E?
)
Approach #2: TAGME

[Ferragina & Scaiella, 2010]
- Combine the two most important features (commonness and
relatedness) using a voting scheme

- The score of a candidate entity for a particular mention:

- The vote function estimates the agreement between e and all
candidate entities of all other mentions in the document
score(m, e) =
X
m02Md{m}
vote(m0
, e)
TAGME (voting mechanism)
- Average relatedness between each possible disambiguation,
weighted by its commonness score
vote(m0
, e) =
P
e02Em0
WLM(e, e0
)P(e0
|m0
)
|Em0 |
entity
entity
mention mention m
entity
entity
entity e
entity
entity e’
mention m’
entity
P(e’|m’)
WLM(e,e’)
TAGME (final score)
- Final decision uses a simple but robust heuristic

- The top entities with the highest score are considered for a given
mention and the one with the highest commonness score is selected
(m) = arg max
e2Em
{P(e|m) : e 2 top✏[score(m, e)]}
score(m, e) =
X
m02Md{m}
vote(m0
, e)
Collective disambiguation
- Graph-based representation

- Mention-entity edges capture the local compatibility
between the mention and the entity

- Measured using a combination of context-independent and context-
dependent features
- Entity-entity edges represent the semantic relatedness
between a pair of entities

- Common choice is relatedness (WLM)
- Use these relations jointly to identify a single referent entity
(or none) for each of the mentions
Example
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
AIDA [Hoffart et al., 2011]
- Problem formulation: find a dense subgraph that contains all
mention nodes and exactly one mention-entity edge for each
mention

- Greedy algorithm iteratively removes edges
Algorithm
- Start with the full graph 

- Iteratively remove the entity node with the lowest weighted
degree (along with all its incident edges), provided that each
mention node remains connected to at least one entity

- Weighted degree of an entity node is the sum of the weights of its
incident edges
- The graph with the highest density is kept as the solution

- The density of the graph is measured as the minimum weighted degree
among its entity nodes
Example iteration #1
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
0.01
weighted degree
Which entity should be removed?
0.95
0.86
0.03
1.56
0.12
xx
What is the density of the graph? 0.03
Example iteration #2
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
weighted degree
Which entity should be removed?
0.95
0.86
0.03
1.56
0.12
xx
x x
What is the density of the graph? 0.12
Example iteration #3
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
weighted degree
Which entity should be removed?
0.95
0.86
0.03
1.56
0.12
xx
x x
x x
What is the density of the graph? 0.86
Pre- and post-processing
- Pre-processing phase: remove entities that are "too distant"
from the mention nodes

- At the end of the iterations, the solution graph may still
contain mentions that are connected to more than one entity;
deal with this in post-processing

- If the graph is sufficiently small, it is feasible to exhaustively consider all
possible mention-entity pairs
- Otherwise, a faster local (hill-climbing) search algorithm may be used
Random walk with restarts [Han et al., 2011]
- "A random walk with restart is a stochastic process to traverse
a graph, resulting in a probability distribution over the vertices
corresponding to the likelihood those vertices are
visited" [Guo et al., 2014]

- Starting vector holding initial evidence
v(m) =
tfidf(m)
P
m0inMd
tfidf(m0)
Random walk process
- T is evidence propagation matrix

- Evidence propagation ratio
- Evidence propagation
T(m ! e) =
w(m ! e)
P
e02Em
w(m ! e0)
T(e ! e0
) =
w(e ! e0
)
P
e002Ed
w(e ! e00)
r0
= v
rt+1
= (1 ↵) ⇥ rt
⇥ T + ↵ ⇥ v
vector holding the probability
that the nodes are visited
restart probability (0.1)
initial evidence
Final decision
- Once the random walk process has converged to a stationary
distribution, the referent entity for each mention
(m) = arg max
e2Em
(m, e) ⇥ r(e)
local compatibility
(e, m) =
X
i
ifi(e, m)
Pruning
- Discarding meaningless or low-confidence annotations
produced by the disambiguation phase

- Simplest solution: use a confidence threshold

- More advanced solutions

- Machine learned classifier to retain only entities that are ``relevant
enough'' (human editor would annotate them) [Milne & Witten, 2008]
- Optimization problem: decide, for each mention, whether switching the
top ranked disambiguation to NIL would improve the objective function
[Ratinov et al., 2011]
Entity linking systems
Publicly available entity linking systems
System
Reference
KB
Online
demo
Web
API
Source
code
AIDA

http://www.mpi-inf.mpg.de/yago-naga/aida/

YAGO2 Yes Yes Yes (Java)
DBpedia Spotlight

http://spotlight.dbpedia.org/
DBpedia Yes Yes Yes (Java)
Illinois Wikifier

http://cogcomp.cs.illinois.edu/page/download_view/Wikifier
Wikipedia No No Yes (Java)
TAGME

https://tagme.d4science.org/tagme/
Wikipedia Yes Yes Yes (Java)
Wikipedia Miner

https://github.com/dnmilne/wikipediaminer
Wikipedia No No Yes (Java)
Publicly available entity linking systems
- AIDA [Hoffart et al., 2011]
- Collective disambiguation using a graph-based approach (see under
Disambiguation section)
- DBpedia Spotlight [Mendes et al., 2011]

- Straightforward local disambiguation approach
- Vector space representations of entities are compared against the paragraphs of
their mentions using the cosine similarity;
- Introduce Inverse Candidate Frequency (ICF) instead of using IDF
- Annotate with DBpedia entities, which can be restricted to certain
types or even to a custom entity set defined by a SPARQL query
Publicly available entity linking systems
- Illinois Wikifier (a.k.a. GLOW) [Ratinov et al., 2011]

- Individual global disambiguation (see under Disambiguation section)
- TAGME [Ferragina & Scaiella, 2011]

- One of the most popular entity linking systems
- Designed specifically for efficient annotation of short texts, but works
well on long texts too
- Individual global disambiguation (see under Disambiguation section)
- Wikipedia Miner [Milne & Witten, 2008]

- Seminal entity linking system that was first to ...
- combine commonness and relatedness for (local) disambiguation
- have open-source implementation and be offered as a web service
Evaluation
Evaluation (end-to-end)
- Comparing the system-generated annotations against a
human-annotated gold standard

- Evaluation criteria

- Perfect match: both the linked entity and the mention offsets must
match
- Relaxed match: the linked entity must match, it is sufficient if the
mention overlaps with the gold standard
Evaluation with relaxed match
Example #1 Example #2
ground truth system annotation ground truth system annotation
Evaluation metrics
- Set-based metrics:

- Precision: fraction of correctly linked entities that have been annotated
by the system
- Recall: fraction of correctly linked entities that should be annotated
- F-measure: harmonic mean of precision and recall
- Metrics are computed over a collection of documents

- Micro-averaged: aggregated across mentions
- Macro-averaged: aggregated across documents
Evaluation metrics
- Micro-averaged

- Macro-averaged

- F1 score
Pmic =
|AD  ˆAD|
|AD|
Rmic =
|AD  ˆAD|
| ˆAD|
ground truth
annotations
annotations generated by
the entity linking system
Pmac =
X
d2D
|Ad  ˆAd|
|Ad|
/|D| Rmac =
X
d2D
|Ad  ˆAd|
| ˆAd|
/|D|
F1 =
2 ⇥ P ⇥ R
P + R
Test collections by individual researchers
System
Reference
KB
Documents #Docs Annots. #Mentions All? NIL
MSNBC 

[Cucerzan, 2007]
Wikipedia News 20 entities 755 Yes Yes
AQUAINT 

[Milne & Witten, 2008]
Wikipedia News 50
entities &
concepts
727 No No
IITB 

[Kulkarni et al, 2009]
Wikipedia Web pages 107 entities 17,200 Yes Yes
ACE2004 

[Ratinov et al., 2011]
Wikipedia News 57 entities 306 Yes Yes
CoNLL-YAGO 

[Hoffart et al., 2011]
YAGO2 News 1,393 entites 34,956 Yes Yes
Evaluation campaigs
- INEX Link-the-Wiki

- TAC Entity Linking

- Entity Recognition and Disambiguation (ERD) Challenge
Component-based evaluation
- The pipeline architecture makes the evaluation of entity
linking systems especially challenging

- The main focus is on the disambiguation component, but its
performance is largely influenced by the preceding steps
- Fair comparison between two approaches can only be made
if they share all other elements of the pipeline
Meta evaluations
- [Hatchey et al., 2009] 

- Reimplements three state-of-the-art systems
- DEXTER [Ceccarelli et al., 2013]

- Reimplement TAGME, Wikipedia Miner, and collective EL [Han et al., 2011]
- BAT-Framework [Cornolti et al., 2013]

- Comparing publicly available entity annotation systems (AIDA, Illinois
Wikifier, TAGME, Wikipedia Miner, DBpedia Spotlight)
- Number of test collections (news, web pages, and tweets)
- GEBRIL [Usbeck et al., 2015]

- Extends the BAT-Framework by being able to link to any knowledge base
- Benchmarking any annotation service through a REST interface
- Provides persistent URLs for experimental settings (for reproducibility)
Resources
A Cross-Lingual Dictionary for English
Wikipedia Concepts
- Collecting strings (mentions) that link to Wikipedia articles on
the Web

- https://research.googleblog.com/2012/05/from-words-to-concepts-
and-back.html
number of times the mention
is linked to that article
string (mention) Wikipedia article
Freebase Annotations of the ClueWeb
Corpora
- ClueWeb annotated with Freebase entities (by Google)

- http://lemurproject.org/clueweb09/FACC1/
- http://lemurproject.org/clueweb12/FACC1/
name of the document
that was annotated
entity mention beginning and end
byte offsets
entity ID in Freebaseconfidence given
both the mention
and the context
confidence given
just the context
(ignoring the
mention)
Entity linking in queries
Entity linking in queries
- Challenges

- search queries are short
- limited context
- lack of proper grammar, spelling
- multiple interpretations
- needs to be fast
Example
the governator
movie
person
Example
the governator movie
Example
new york pizza manhattan
Example
new york pizza manhattan
ERD’14 challenge
- Task: finding query interpretations

- Input: keyword query

- Output: sets of sets of entities

- Reference KB: Freebase

- Annotations are to be performed by a web service within a
given time limit
Evaluation
P =
|I
T ˆI|
|I|
R =
|I
T ˆI|
|ˆI|
F =
2 · P · R
P + R
New York City, Manhattan
ground truth system annotationˆI I
new york pizza manhattan
New York-style pizza, Manhattan
New York City, Manhattan
New York-style pizza
ERD’14 results
Single
interpretation
is returned
Rank Team F1 latency
1 SMAPH Team 0.7076 0.49
2 NTUNLP 0.6797 1.04
3 Seznam Research 0.6693 3.91
http://web-ngram.research.microsoft.com/erd2014/LeaderBoard.aspx
Hands-on part

http://bit.ly/russir2016-elx
Questions?
Contact | @krisztianbalog | krisztianbalog.com

More Related Content

What's hot

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSchool of Data
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Jeet Das
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 DigiGurukul
 

What's hot (20)

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Rule Based System
Rule Based SystemRule Based System
Rule Based System
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Markov Random Field (MRF)
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Tabu search
Tabu searchTabu search
Tabu search
 
Temporal data mining
Temporal data miningTemporal data mining
Temporal data mining
 
Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1 Artificial Intelligence Notes Unit 1
Artificial Intelligence Notes Unit 1
 

Similar to Entity Linking at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016

Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,Nexgen Technology
 
Information retrieval and extraction
Information retrieval and extractionInformation retrieval and extraction
Information retrieval and extractionAnkit Sharma
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelMarcia Zeng
 
Repositories thru the looking glass
Repositories thru the looking glassRepositories thru the looking glass
Repositories thru the looking glassEduserv Foundation
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic WebKurt Cagle
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)krisztianbalog
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITAnkit Sharma
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)Julie Allinson
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 
Content Repositories vs Knowledge Bases
Content Repositories vs Knowledge BasesContent Repositories vs Knowledge Bases
Content Repositories vs Knowledge Basesgokcebanu
 
Text Data Mining
Text Data MiningText Data Mining
Text Data MiningKU Leuven
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)krisztianbalog
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Searchkrisztianbalog
 
Eprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEduserv Foundation
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...andrea huang
 
The Eprints Application Profile: a FRBR approach to modelling repository meta...
The Eprints Application Profile: a FRBR approach to modelling repository meta...The Eprints Application Profile: a FRBR approach to modelling repository meta...
The Eprints Application Profile: a FRBR approach to modelling repository meta...Julie Allinson
 

Similar to Entity Linking at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016 (20)

Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,
 
Information retrieval and extraction
Information retrieval and extractionInformation retrieval and extraction
Information retrieval and extraction
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data model
 
Repositories thru the looking glass
Repositories thru the looking glassRepositories thru the looking glass
Repositories thru the looking glass
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Content Repositories vs Knowledge Bases
Content Repositories vs Knowledge BasesContent Repositories vs Knowledge Bases
Content Repositories vs Knowledge Bases
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)
 
Eprints Application Profile
Eprints Application ProfileEprints Application Profile
Eprints Application Profile
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Search
 
Eprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, Mexico
 
Data modeling
Data modelingData modeling
Data modeling
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
 
The Eprints Application Profile: a FRBR approach to modelling repository meta...
The Eprints Application Profile: a FRBR approach to modelling repository meta...The Eprints Application Profile: a FRBR approach to modelling repository meta...
The Eprints Application Profile: a FRBR approach to modelling repository meta...
 

More from krisztianbalog

Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...krisztianbalog
 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...krisztianbalog
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?krisztianbalog
 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphskrisztianbalog
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligencekrisztianbalog
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluationkrisztianbalog
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generationkrisztianbalog
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Nextkrisztianbalog
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Editionkrisztianbalog
 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Labkrisztianbalog
 
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)krisztianbalog
 
Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)krisztianbalog
 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systemskrisztianbalog
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendationkrisztianbalog
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seachkrisztianbalog
 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Searchkrisztianbalog
 

More from krisztianbalog (16)

Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphs
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Next
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Edition
 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
 
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
 
Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)
 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systems
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendation
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Search
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Entity Linking at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016

  • 1. Entity Linking 10th Russian Summer School in Information Retrieval (RuSSIR 2016) | Saratov, Russia, 2016 Krisztian Balog University of Stavanger
 @krisztianbalog
  • 2. From Named Entity Recognition to Entity Linking
  • 3. Named entity recognition (NER) - Also known as entity identification, entity extraction, and entity chunking - Task: identifying named entities in text and labeling them with one of the possible entity types - Person (PER), organization (ORG), location (LOC), miscellaneous (MISC). Sometimes also temporal expressions (TIMEX) and certain types of numerical expressions (NUMEX) <LOC>Silicon Valley</LOC> venture capitalist <PER>Michael Moritz</PER> said that today's billion-dollar "unicorn" startups can learn from <ORG>Apple</ORG> founder <PER>Steve Jobs</PER>
  • 4. Named entity disambiguation - Also called named entity normalization and named entity resolution - Task: assign ambiguous entity names to canonical entities from some catalog - It is usually assumed that entities have already been recognized in the input text (i.e., it has been processed by a NER system)
  • 5. Wikification - Named entity disambiguation using Wikipedia as the catalog of entities - Also annotating concepts, not only entities
  • 6. Entity linking - Task: recognizing entity mentions in text and linking them to the corresponding entries in a knowledge base (KB) - Limited to recognizing entities for which a target entry exists in the reference KB; each KB entry is a candidate - It is assumed that the document provides sufficient context for disambiguating entities - Knowledge base (working definition): - A catalog of entities, each with one or more names (surface forms), links to other entities, and, optionally, a textual description - Wikipedia, DBpedia, Freebase, YAGO, etc.
  • 7. Overview of entity annotation tasks Task Recognition Assignment Named entity recognition entities entity type Named entity disambiguation entities entity ID / NIL Wikification entities and concepts entity ID / NIL Entity linking entities entity ID
  • 10. Anatomy of an entity linking system Mention detection Candidate selection Disambiguation entity annotations document Identification of text snippets that can potentially be linked to entities Generating a set of candidate entities for each mention Selecting a single entity (or none) for each mention, based on the context
  • 12. Mention detection - Goal: Detect all “linkable” phrases - Challenges: - Recall oriented - Do not miss any entity that should be linked - Find entity name variants - E.g. “jlo” is name variant of [Jennifer Lopez] - Filter out inappropriate ones - E.g. “new york” matches >2k different entities
  • 13. Common approach 1. Build a dictionary of entity surface forms - Entities with all names variants 2. Check all document n-grams against the dictionary - The value of n is set typically between 6 and 8 3. Filter out undesired entities - Can be done here or later in the pipeline
  • 14. Example Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites, Entities (Es)Surface form (s) … … Times_Square Times_Square_(Hong_Kong) Times_Square_(IRT_42nd_Street_Shuttle) … Times Square …… Empire State Building Empire_State_Building Empire State Empire_State_(band) Empire_State_Building Empire_State_Film_Festival Empire British_Empire Empire_(magazine) First_French_Empire Galactic_Empire_(Star_Wars) Holy_Roman_Empire Roman_Empire … New York City is a fast-paced, globally influential center of art, culture, fashion and finance.
  • 15. Surface form dictionary construction 
 from Wikipedia - Page title - Canonical (most common) name of the entity
  • 16. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Alternative names that are frequently used to refer to an entity
  • 17. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Disambiguation pages - List of entities that share the same name
  • 18. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Disambiguation pages - Anchor texts - of links pointing to the entity's Wikipedia page
  • 19. Surface form dictionary construction 
 from Wikipedia - Page title - Redirect pages - Disambiguation pages - Anchor texts - Bold texts from first paragraph - generally denote other name variants of the entity
  • 20. Surface form dictionary construction 
 from other sources - Anchor texts from external web pages pointing to Wikipedia articles - Problem of synonym discovery - Expanding acronyms - Leveraging search results or query-click logs from a web search engine - ...
  • 21. Filtering mentions - Filter our mentions that are unlikely to be linked to any entity - Keyphraseness: - Link probability: P(keyphrase|m) = |Dlink(m)| |D(m)| P(link|m) = link(m) freq(m) number of Wikipedia articles where m appears as a link number of Wikipedia articles that contain m the number of times mention m appears as a link total number of times mention m occurs in Wikipedia (as a link or not)
  • 22. Overlapping entity mentions - Dealing with them in this phase - E.g., by dropping a mention if it is subsumed by another mention - Keeping them and postponing the decision to a later stage (candidate selection or disambiguation)
  • 24. Candidate selection - Goal: Narrow down the space of disambiguation possibilities - Balances between precision and recall (effectiveness vs. efficiency) - Often approached as a ranking problem - Keeping only candidates above a score/rank threshold for downstream processing
  • 25. Commonness - Perform the ranking of candidate entities based on their overall popularity, i.e., "most common sense" P(e|m) = n(m, e) P e0 n(m, e0) the number of times entity e is the link destination of mention m total number of times mention m appears as a link
  • 26. Example Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites, New York City is a fast-paced, globally influential center of art, culture, fashion and finance. Times_Square_(IRT_42nd_Street_Shuttle) 0.006 … … Commonness (P(e|m)) Entity (e) 0.011Times_Square_(Hong_Kong) 0.017Times_Square_(film) Times_Square 0.940
  • 27. Commonness - Can be pre-computed and stored in the entity surface form dictionary - Follows a power law with a long tail of extremely unlikely senses; entities at the tail end of the distribution can be safely discarded - E.g., 0.001 is a sensible threshold
  • 28. Example Entity Commonness Germany 0.9417 Germany_national_football_team 0.0139 Nazi_Germany 0.0081 German_Empire 0.0065 ... Entity Commonness FIFA_World_Cup 0.2358 FIS_Apline_Ski_World_Cup 0.0682 2009_FINA_Swimming_World_Cup 0.0633 World_Cup_(men's_golf) 0.0622 ... Entity Commonness 1998_FIFA_World_Cup 0.9556 1998_IAAF_World_Cup 0.0296 1998_Alpine_Skiing_World_Cup 0.0059 ... Bulgaria's best World Cup performance was in the 1994 World Cup where they beat Germany, to reach the semi-finals, losing to Italy, and finishing in fourth ...
  • 30. Disambiguation - Baseline approach: most common sense - Consider additional types of evidence - Prior importance of entities and mentions - Contextual similarity between the text surrounding the mention and the candidate entity - Coherence among all entity linking decisions in the document - Combine these signals - Using supervised learning or graph-based approaches - Optionally perform pruning - Reject low confidence or semantically meaningless annotations
  • 31. Prior importance features - Context-independent features - Neither the text nor other mentions in the document are taken into account - Keyphraseness - Link probability - Commonness
  • 32. Prior importance features - Link prior - Popularity of the entity measured in terms of incoming links - Page views - Popularity of the entity measured in terms traffic volume Plink(e) = link(e) P e0 link(e0) Ppageviews(e) = pageviews(e) P e0 pageviews(e0)
  • 33. Contextual features - Compare the surrounding context of a mention with the (textual) representation of the given candidate entity - Context of a mention - Window of text (sentence, paragraph) around the mention - Entire document - Entity's representation - Wikipedia entity page, first description paragraph, terms with highest TF-IDF score, etc. - Entity's description in the knowledge base
  • 34. Contextual similarity - Commonly: bag-of-words representation - Cosine similarity - Many other options for measuring similarity - Dot product, KL divergence, Jaccard similarity - Representation does not have to be limited to bag-of-words - Concept vectors (named entities, Wikipedia categories, anchor text, keyphrases, etc.) simcos(m, e) = ~dm · ~de k ~dm kk ~de k
  • 35. Entity-relatedness features - It can reasonably be assumed that a document focuses on one or at most a few topics - Therefore, entities mentioned in a document should be topically related to each other - Capturing topical coherence by developing some measure of relatedness between (linked) entities - Defined for pairs of entities
  • 36. Wikipedia Link-based Measure (WSM) - Often referred to simply as relatedness - A close relationship is assumed between two entities if there is a large overlap between the entities linking to them WLM(e, e0 ) = 1 log(max(|Le|, |Le0 |)) log(|Le Le0 |) log(|E|) log(min(|Le|, |Le0 |)) set of entities that link to etotal number of entities
  • 37. Wikipedia Link-based Measure (WLM) Gabrilovich and Markovitch (2007) achieve extremely accurate results with ESA, a technique that is somewhat reminiscent of the vector space model widely used in information retrieval. Instead of comparing vectors of term weights to evaluate the similarity between queries and documents, they compare weighted vectors of the Wikipedia articles related to each term. The name of the approach—Explicit Semantic Analysis—stems from the way these vectors are comprised of manually defined of explicitly defined semantics. Despite the name, Explicit Semantic Analysis takes advantage of only one property: the way in which Wikipedia’s text is segmented into individual topics. It’s central component—the weight between a term and an article—is automatically derived rather than explicitly specified. In contrast, the central component of our approach is the link: a manually-defined connection between two manually disambiguated concepts. Wikipedia provides millions of these connections, as Global WarmingAutomobile Petrol Engine Fossil Fuel 20th Century Emission Standard Bicycle Diesel Engine Carbon Dioxide Air Pollution Greenhouse Gas Alternative Fuel Transport Vehicle Henry Ford Combustion Engine Kyoto Protocol Ozone Greenhouse Effect Planet Audi Battery (electricity) Arctic Circle Environmental Skepticism Greenpeace Ecology incoming links outgoing links Figure 1: Obtaining a semantic relatedness measure between Automobile and Global Warming from Wikipedia links. incoming links outgoing links 26 Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In AAAI WikiAI Workshop.
  • 38. Asymmetric relatedness features - A relatedness function does not have to be symmetric - E.g., the relatedness of the UNITED STATES given NEIL ARMSTRONG is intuitively larger than the relatedness of NEIL ARMSTRONG given the UNITED STATES - Conditional probability P(e0 |e) = |Le0 Le| |Le|
  • 39. Entity-relatedness features - Numerous ways to define relatedness - Consider not only incoming, but also outgoing links or the union of incoming and outgoing links - Jaccard similarity, Pointwise Mutual Information (PMI), or the Chi- square statistic, etc. - Having a single relatedness function is preferred, to keep the disambiguation process simple - Various relatedness measures can effectively be combined into a single score using a machine learning approach [Ceccarelli et al., 2013]
  • 41. Disambiguation approaches - Consider local compatibility (including prior evidence) and coherence with the other entity linking decisions - Task: - Objective function: ⇤ = arg max X (m,e)2 (m, e) + ( ) : Md ! E [ {;} coherence function for all entity annotations in the document local compatibility between the mention and the assigned entity This optimization problem is NP-hard! Need to resort to approximation algorithms and heuristics
  • 42. Disambiguation strategies - Individually, one-mention-at-a-time - Rank candidates for each mention, take the top ranked one (or NIL) - Interdependence between entity linking decisions may be incorporated in a pairwise fashion - Collectively, all mentions in the document jointly (m) = arg max e2Em score(m, e)
  • 43. Disambiguation approaches Approach Context Entity interdependence Most common sense none none Individual local disambiguation text none Individual global disambiguation text & entities pairwise Collective disambiguation text & entities collective
  • 44. Individual local disambiguation - Early entity linking approaches - Local compatibility score can be written as a linear combination of features - Learn the "optimal" combination of features from training data using machine learning (e, m) = X i ifi(e, m) Can be both context-independent and context-dependent features
  • 45. Individual global disambiguation - Consider what other entities are mentioned in the document - True global optimization would be NP-hard - Good approximation can be computed efficiently by considering pairwise interdependencies for each mention independently - Pairwise entity relatedness scores need to be aggregated into a single number (how coherent the given candidate entity is with the rest of the entities in the document)
  • 46. Approach #1: GLOW
 [Ratinov et al., 2011] - First perform local disambiguation (E*) - The coherence of entity e with all other linked entities in the document: - Use the predictions in a second, global disambiguation round gj(e, E? ) = [ X e02E?{e} fj(e, e0 )] aggregator function (max or avg) particular relatedness function score(m, e) = X i ifi(e, m) + X j jgj(e, E? )
  • 47. Approach #2: TAGME
 [Ferragina & Scaiella, 2010] - Combine the two most important features (commonness and relatedness) using a voting scheme - The score of a candidate entity for a particular mention: - The vote function estimates the agreement between e and all candidate entities of all other mentions in the document score(m, e) = X m02Md{m} vote(m0 , e)
  • 48. TAGME (voting mechanism) - Average relatedness between each possible disambiguation, weighted by its commonness score vote(m0 , e) = P e02Em0 WLM(e, e0 )P(e0 |m0 ) |Em0 | entity entity mention mention m entity entity entity e entity entity e’ mention m’ entity P(e’|m’) WLM(e,e’)
  • 49. TAGME (final score) - Final decision uses a simple but robust heuristic - The top entities with the highest score are considered for a given mention and the one with the highest commonness score is selected (m) = arg max e2Em {P(e|m) : e 2 top✏[score(m, e)]} score(m, e) = X m02Md{m} vote(m0 , e)
  • 50. Collective disambiguation - Graph-based representation - Mention-entity edges capture the local compatibility between the mention and the entity - Measured using a combination of context-independent and context- dependent features - Entity-entity edges represent the semantic relatedness between a pair of entities - Common choice is relatedness (WLM) - Use these relations jointly to identify a single referent entity (or none) for each of the mentions
  • 51. Example edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions
  • 52. AIDA [Hoffart et al., 2011] - Problem formulation: find a dense subgraph that contains all mention nodes and exactly one mention-entity edge for each mention - Greedy algorithm iteratively removes edges
  • 53. Algorithm - Start with the full graph - Iteratively remove the entity node with the lowest weighted degree (along with all its incident edges), provided that each mention node remains connected to at least one entity - Weighted degree of an entity node is the sum of the weights of its incident edges - The graph with the highest density is kept as the solution - The density of the graph is measured as the minimum weighted degree among its entity nodes
  • 54. Example iteration #1 edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions 0.01 weighted degree Which entity should be removed? 0.95 0.86 0.03 1.56 0.12 xx What is the density of the graph? 0.03
  • 55. Example iteration #2 edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions weighted degree Which entity should be removed? 0.95 0.86 0.03 1.56 0.12 xx x x What is the density of the graph? 0.12
  • 56. Example iteration #3 edge between a name mention and an entity represents a Compatible relation between them; each edge between two entities represents a Semantic-Related relation between them. For illustration, Figure 2 shows the Referent Graph representation of the EL problem in Example 1. Space Jam Chicago Bulls Bull Michael Jordan Michael I. Jordan Michael B. Jordan Space Jam Bulls Jordan Mention Entity 0.66 0.82 0.13 0.01 0.20 0.12 0.03 0.08 Figure 2. The Referent Graph of Example 1 By representing both the local mention-to-entity compatibility 2) Ca can in Wi ent tex 3) No to Co ref pai Re the 4. CO In this se which c mentions weighted degree Which entity should be removed? 0.95 0.86 0.03 1.56 0.12 xx x x x x What is the density of the graph? 0.86
  • 57. Pre- and post-processing - Pre-processing phase: remove entities that are "too distant" from the mention nodes - At the end of the iterations, the solution graph may still contain mentions that are connected to more than one entity; deal with this in post-processing - If the graph is sufficiently small, it is feasible to exhaustively consider all possible mention-entity pairs - Otherwise, a faster local (hill-climbing) search algorithm may be used
  • 58. Random walk with restarts [Han et al., 2011] - "A random walk with restart is a stochastic process to traverse a graph, resulting in a probability distribution over the vertices corresponding to the likelihood those vertices are visited" [Guo et al., 2014] - Starting vector holding initial evidence v(m) = tfidf(m) P m0inMd tfidf(m0)
  • 59. Random walk process - T is evidence propagation matrix - Evidence propagation ratio - Evidence propagation T(m ! e) = w(m ! e) P e02Em w(m ! e0) T(e ! e0 ) = w(e ! e0 ) P e002Ed w(e ! e00) r0 = v rt+1 = (1 ↵) ⇥ rt ⇥ T + ↵ ⇥ v vector holding the probability that the nodes are visited restart probability (0.1) initial evidence
  • 60. Final decision - Once the random walk process has converged to a stationary distribution, the referent entity for each mention (m) = arg max e2Em (m, e) ⇥ r(e) local compatibility (e, m) = X i ifi(e, m)
  • 61. Pruning - Discarding meaningless or low-confidence annotations produced by the disambiguation phase - Simplest solution: use a confidence threshold - More advanced solutions - Machine learned classifier to retain only entities that are ``relevant enough'' (human editor would annotate them) [Milne & Witten, 2008] - Optimization problem: decide, for each mention, whether switching the top ranked disambiguation to NIL would improve the objective function [Ratinov et al., 2011]
  • 63. Publicly available entity linking systems System Reference KB Online demo Web API Source code AIDA
 http://www.mpi-inf.mpg.de/yago-naga/aida/
 YAGO2 Yes Yes Yes (Java) DBpedia Spotlight
 http://spotlight.dbpedia.org/ DBpedia Yes Yes Yes (Java) Illinois Wikifier
 http://cogcomp.cs.illinois.edu/page/download_view/Wikifier Wikipedia No No Yes (Java) TAGME
 https://tagme.d4science.org/tagme/ Wikipedia Yes Yes Yes (Java) Wikipedia Miner
 https://github.com/dnmilne/wikipediaminer Wikipedia No No Yes (Java)
  • 64. Publicly available entity linking systems - AIDA [Hoffart et al., 2011] - Collective disambiguation using a graph-based approach (see under Disambiguation section) - DBpedia Spotlight [Mendes et al., 2011] - Straightforward local disambiguation approach - Vector space representations of entities are compared against the paragraphs of their mentions using the cosine similarity; - Introduce Inverse Candidate Frequency (ICF) instead of using IDF - Annotate with DBpedia entities, which can be restricted to certain types or even to a custom entity set defined by a SPARQL query
  • 65. Publicly available entity linking systems - Illinois Wikifier (a.k.a. GLOW) [Ratinov et al., 2011] - Individual global disambiguation (see under Disambiguation section) - TAGME [Ferragina & Scaiella, 2011] - One of the most popular entity linking systems - Designed specifically for efficient annotation of short texts, but works well on long texts too - Individual global disambiguation (see under Disambiguation section) - Wikipedia Miner [Milne & Witten, 2008] - Seminal entity linking system that was first to ... - combine commonness and relatedness for (local) disambiguation - have open-source implementation and be offered as a web service
  • 67. Evaluation (end-to-end) - Comparing the system-generated annotations against a human-annotated gold standard - Evaluation criteria - Perfect match: both the linked entity and the mention offsets must match - Relaxed match: the linked entity must match, it is sufficient if the mention overlaps with the gold standard
  • 68. Evaluation with relaxed match Example #1 Example #2 ground truth system annotation ground truth system annotation
  • 69. Evaluation metrics - Set-based metrics: - Precision: fraction of correctly linked entities that have been annotated by the system - Recall: fraction of correctly linked entities that should be annotated - F-measure: harmonic mean of precision and recall - Metrics are computed over a collection of documents - Micro-averaged: aggregated across mentions - Macro-averaged: aggregated across documents
  • 70. Evaluation metrics - Micro-averaged - Macro-averaged - F1 score Pmic = |AD ˆAD| |AD| Rmic = |AD ˆAD| | ˆAD| ground truth annotations annotations generated by the entity linking system Pmac = X d2D |Ad ˆAd| |Ad| /|D| Rmac = X d2D |Ad ˆAd| | ˆAd| /|D| F1 = 2 ⇥ P ⇥ R P + R
  • 71. Test collections by individual researchers System Reference KB Documents #Docs Annots. #Mentions All? NIL MSNBC 
 [Cucerzan, 2007] Wikipedia News 20 entities 755 Yes Yes AQUAINT 
 [Milne & Witten, 2008] Wikipedia News 50 entities & concepts 727 No No IITB 
 [Kulkarni et al, 2009] Wikipedia Web pages 107 entities 17,200 Yes Yes ACE2004 
 [Ratinov et al., 2011] Wikipedia News 57 entities 306 Yes Yes CoNLL-YAGO 
 [Hoffart et al., 2011] YAGO2 News 1,393 entites 34,956 Yes Yes
  • 72. Evaluation campaigs - INEX Link-the-Wiki - TAC Entity Linking - Entity Recognition and Disambiguation (ERD) Challenge
  • 73. Component-based evaluation - The pipeline architecture makes the evaluation of entity linking systems especially challenging - The main focus is on the disambiguation component, but its performance is largely influenced by the preceding steps - Fair comparison between two approaches can only be made if they share all other elements of the pipeline
  • 74. Meta evaluations - [Hatchey et al., 2009] - Reimplements three state-of-the-art systems - DEXTER [Ceccarelli et al., 2013] - Reimplement TAGME, Wikipedia Miner, and collective EL [Han et al., 2011] - BAT-Framework [Cornolti et al., 2013] - Comparing publicly available entity annotation systems (AIDA, Illinois Wikifier, TAGME, Wikipedia Miner, DBpedia Spotlight) - Number of test collections (news, web pages, and tweets) - GEBRIL [Usbeck et al., 2015] - Extends the BAT-Framework by being able to link to any knowledge base - Benchmarking any annotation service through a REST interface - Provides persistent URLs for experimental settings (for reproducibility)
  • 76. A Cross-Lingual Dictionary for English Wikipedia Concepts - Collecting strings (mentions) that link to Wikipedia articles on the Web - https://research.googleblog.com/2012/05/from-words-to-concepts- and-back.html number of times the mention is linked to that article string (mention) Wikipedia article
  • 77. Freebase Annotations of the ClueWeb Corpora - ClueWeb annotated with Freebase entities (by Google) - http://lemurproject.org/clueweb09/FACC1/ - http://lemurproject.org/clueweb12/FACC1/ name of the document that was annotated entity mention beginning and end byte offsets entity ID in Freebaseconfidence given both the mention and the context confidence given just the context (ignoring the mention)
  • 78. Entity linking in queries
  • 79. Entity linking in queries - Challenges - search queries are short - limited context - lack of proper grammar, spelling - multiple interpretations - needs to be fast
  • 84. ERD’14 challenge - Task: finding query interpretations - Input: keyword query - Output: sets of sets of entities - Reference KB: Freebase - Annotations are to be performed by a web service within a given time limit
  • 85. Evaluation P = |I T ˆI| |I| R = |I T ˆI| |ˆI| F = 2 · P · R P + R New York City, Manhattan ground truth system annotationˆI I new york pizza manhattan New York-style pizza, Manhattan New York City, Manhattan New York-style pizza
  • 86. ERD’14 results Single interpretation is returned Rank Team F1 latency 1 SMAPH Team 0.7076 0.49 2 NTUNLP 0.6797 1.04 3 Seznam Research 0.6693 3.91 http://web-ngram.research.microsoft.com/erd2014/LeaderBoard.aspx
  • 88.
  • 89. Questions? Contact | @krisztianbalog | krisztianbalog.com