Entity linking involves recognizing entity mentions in text and linking them to entries in a knowledge base. It includes three main steps: mention detection to identify linkable phrases, candidate selection to generate potential entities for each mention, and disambiguation to select the best entity for each mention based on context. Disambiguation approaches consider local compatibility of mentions and entities as well as global coherence across all entity linking decisions. Collective approaches aim to jointly optimize all entity annotations for a document.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Entity Linking at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016
1. Entity Linking
10th Russian Summer School in Information Retrieval (RuSSIR 2016) | Saratov, Russia, 2016
Krisztian Balog
University of Stavanger
@krisztianbalog
3. Named entity recognition (NER)
- Also known as entity identification, entity extraction, and
entity chunking
- Task: identifying named entities in text and labeling them with
one of the possible entity types
- Person (PER), organization (ORG), location (LOC), miscellaneous
(MISC). Sometimes also temporal expressions (TIMEX) and certain
types of numerical expressions (NUMEX)
<LOC>Silicon Valley</LOC> venture capitalist <PER>Michael
Moritz</PER> said that today's billion-dollar "unicorn" startups
can learn from <ORG>Apple</ORG> founder <PER>Steve Jobs</PER>
4. Named entity disambiguation
- Also called named entity normalization and named entity
resolution
- Task: assign ambiguous entity names to canonical entities
from some catalog
- It is usually assumed that entities have already been recognized in the
input text (i.e., it has been processed by a NER system)
5. Wikification
- Named entity disambiguation using Wikipedia as the catalog
of entities
- Also annotating concepts, not only entities
6. Entity linking
- Task: recognizing entity mentions in text and linking them to
the corresponding entries in a knowledge base (KB)
- Limited to recognizing entities for which a target entry exists in the
reference KB; each KB entry is a candidate
- It is assumed that the document provides sufficient context for
disambiguating entities
- Knowledge base (working definition):
- A catalog of entities, each with one or more names (surface forms),
links to other entities, and, optionally, a textual description
- Wikipedia, DBpedia, Freebase, YAGO, etc.
7. Overview of entity annotation tasks
Task Recognition Assignment
Named entity recognition entities entity type
Named entity disambiguation entities entity ID / NIL
Wikification entities and concepts entity ID / NIL
Entity linking entities entity ID
10. Anatomy of an entity linking system
Mention
detection
Candidate
selection
Disambiguation
entity
annotations
document
Identification of text snippets that
can potentially be linked to entities
Generating a set of candidate
entities for each mention
Selecting a single entity (or none) for
each mention, based on the context
12. Mention detection
- Goal: Detect all “linkable” phrases
- Challenges:
- Recall oriented
- Do not miss any entity that should be linked
- Find entity name variants
- E.g. “jlo” is name variant of [Jennifer Lopez]
- Filter out inappropriate ones
- E.g. “new york” matches >2k different entities
13. Common approach
1. Build a dictionary of entity surface forms
- Entities with all names variants
2. Check all document n-grams against the dictionary
- The value of n is set typically between 6 and 8
3. Filter out undesired entities
- Can be done here or later in the pipeline
14. Example
Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites,
Entities (Es)Surface form (s)
… …
Times_Square
Times_Square_(Hong_Kong)
Times_Square_(IRT_42nd_Street_Shuttle)
…
Times Square
……
Empire State Building Empire_State_Building
Empire State
Empire_State_(band)
Empire_State_Building
Empire_State_Film_Festival
Empire
British_Empire
Empire_(magazine)
First_French_Empire
Galactic_Empire_(Star_Wars)
Holy_Roman_Empire
Roman_Empire
…
New York City is a fast-paced, globally influential center of art, culture, fashion and finance.
15. Surface form dictionary construction
from Wikipedia
- Page title
- Canonical (most common) name
of the entity
16. Surface form dictionary construction
from Wikipedia
- Page title
- Redirect pages
- Alternative names that are
frequently used to refer to an entity
17. Surface form dictionary construction
from Wikipedia
- Page title
- Redirect pages
- Disambiguation pages
- List of entities that share the same
name
18. Surface form dictionary construction
from Wikipedia
- Page title
- Redirect pages
- Disambiguation pages
- Anchor texts
- of links pointing to the entity's
Wikipedia page
19. Surface form dictionary construction
from Wikipedia
- Page title
- Redirect pages
- Disambiguation pages
- Anchor texts
- Bold texts from first paragraph
- generally denote other name variants
of the entity
20. Surface form dictionary construction
from other sources
- Anchor texts from external web pages pointing to Wikipedia
articles
- Problem of synonym discovery
- Expanding acronyms
- Leveraging search results or query-click logs from a web search engine
- ...
21. Filtering mentions
- Filter our mentions that are unlikely to be linked to any entity
- Keyphraseness:
- Link probability:
P(keyphrase|m) =
|Dlink(m)|
|D(m)|
P(link|m) =
link(m)
freq(m)
number of Wikipedia articles
where m appears as a link
number of Wikipedia articles
that contain m
the number of times mention
m appears as a link
total number of times mention m
occurs in Wikipedia (as a link or not)
22. Overlapping entity mentions
- Dealing with them in this phase
- E.g., by dropping a mention if it is subsumed by another mention
- Keeping them and postponing the decision to a later stage
(candidate selection or disambiguation)
24. Candidate selection
- Goal: Narrow down the space of disambiguation possibilities
- Balances between precision and recall (effectiveness vs.
efficiency)
- Often approached as a ranking problem
- Keeping only candidates above a score/rank threshold for downstream
processing
25. Commonness
- Perform the ranking of candidate entities based on their
overall popularity, i.e., "most common sense"
P(e|m) =
n(m, e)
P
e0 n(m, e0)
the number of times entity e is
the link destination of mention m
total number of times mention m
appears as a link
26. Example
Home to the Empire State Building, Times Square, Statue of Liberty and other iconic sites,
New York City is a fast-paced, globally influential center of art, culture, fashion and finance.
Times_Square_(IRT_42nd_Street_Shuttle) 0.006
… …
Commonness
(P(e|m))
Entity (e)
0.011Times_Square_(Hong_Kong)
0.017Times_Square_(film)
Times_Square 0.940
27. Commonness
- Can be pre-computed and stored in the entity surface form
dictionary
- Follows a power law with a long tail of extremely unlikely
senses; entities at the tail end of the distribution can be safely
discarded
- E.g., 0.001 is a sensible threshold
28. Example
Entity Commonness
Germany 0.9417
Germany_national_football_team 0.0139
Nazi_Germany 0.0081
German_Empire 0.0065
...
Entity Commonness
FIFA_World_Cup 0.2358
FIS_Apline_Ski_World_Cup 0.0682
2009_FINA_Swimming_World_Cup 0.0633
World_Cup_(men's_golf) 0.0622
...
Entity Commonness
1998_FIFA_World_Cup 0.9556
1998_IAAF_World_Cup 0.0296
1998_Alpine_Skiing_World_Cup 0.0059
...
Bulgaria's best World Cup performance was in the
1994 World Cup where they beat Germany, to reach
the semi-finals, losing to Italy, and finishing in
fourth ...
30. Disambiguation
- Baseline approach: most common sense
- Consider additional types of evidence
- Prior importance of entities and mentions
- Contextual similarity between the text surrounding the mention and
the candidate entity
- Coherence among all entity linking decisions in the document
- Combine these signals
- Using supervised learning or graph-based approaches
- Optionally perform pruning
- Reject low confidence or semantically meaningless annotations
31. Prior importance features
- Context-independent features
- Neither the text nor other mentions in the document are taken into
account
- Keyphraseness
- Link probability
- Commonness
32. Prior importance features
- Link prior
- Popularity of the entity measured in terms of incoming links
- Page views
- Popularity of the entity measured in terms traffic volume
Plink(e) =
link(e)
P
e0 link(e0)
Ppageviews(e) =
pageviews(e)
P
e0 pageviews(e0)
33. Contextual features
- Compare the surrounding context of a mention with the
(textual) representation of the given candidate entity
- Context of a mention
- Window of text (sentence, paragraph) around the mention
- Entire document
- Entity's representation
- Wikipedia entity page, first description paragraph, terms with highest
TF-IDF score, etc.
- Entity's description in the knowledge base
34. Contextual similarity
- Commonly: bag-of-words representation
- Cosine similarity
- Many other options for measuring similarity
- Dot product, KL divergence, Jaccard similarity
- Representation does not have to be limited to bag-of-words
- Concept vectors (named entities, Wikipedia categories, anchor text,
keyphrases, etc.)
simcos(m, e) =
~dm · ~de
k ~dm kk ~de k
35. Entity-relatedness features
- It can reasonably be assumed that a document focuses on
one or at most a few topics
- Therefore, entities mentioned in a document should be
topically related to each other
- Capturing topical coherence by developing some measure
of relatedness between (linked) entities
- Defined for pairs of entities
36. Wikipedia Link-based Measure (WSM)
- Often referred to simply as relatedness
- A close relationship is assumed between two entities if there
is a large overlap between the entities linking to them
WLM(e, e0
) = 1
log(max(|Le|, |Le0 |)) log(|Le Le0 |)
log(|E|) log(min(|Le|, |Le0 |))
set of entities that link to etotal number of entities
37. Wikipedia Link-based Measure (WLM)
Gabrilovich and Markovitch (2007) achieve extremely
accurate results with ESA, a technique that is somewhat
reminiscent of the vector space model widely used in
information retrieval. Instead of comparing vectors of term
weights to evaluate the similarity between queries and
documents, they compare weighted vectors of the
Wikipedia articles related to each term. The name of the
approach—Explicit Semantic Analysis—stems from the
way these vectors are comprised of manually defined
of explicitly defined semantics. Despite the name, Explicit
Semantic Analysis takes advantage of only one property:
the way in which Wikipedia’s text is segmented into
individual topics. It’s central component—the weight
between a term and an article—is automatically derived
rather than explicitly specified. In contrast, the central
component of our approach is the link: a manually-defined
connection between two manually disambiguated concepts.
Wikipedia provides millions of these connections, as
Global WarmingAutomobile
Petrol
Engine
Fossil
Fuel
20th
Century Emission
Standard
Bicycle
Diesel
Engine
Carbon
Dioxide
Air
Pollution
Greenhouse
Gas
Alternative
Fuel
Transport
Vehicle
Henry
Ford
Combustion
Engine
Kyoto
Protocol
Ozone
Greenhouse
Effect
Planet
Audi
Battery
(electricity)
Arctic
Circle
Environmental
Skepticism
Greenpeace
Ecology
incoming links
outgoing links
Figure 1: Obtaining a semantic relatedness measure between Automobile and Global Warming from Wikipedia links.
incoming links
outgoing links
26
Image taken from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links.
In AAAI WikiAI Workshop.
38. Asymmetric relatedness features
- A relatedness function does not have to be symmetric
- E.g., the relatedness of the UNITED STATES given NEIL ARMSTRONG is
intuitively larger than the relatedness of NEIL ARMSTRONG given the
UNITED STATES
- Conditional probability
P(e0
|e) =
|Le0 Le|
|Le|
39. Entity-relatedness features
- Numerous ways to define relatedness
- Consider not only incoming, but also outgoing links or the union of
incoming and outgoing links
- Jaccard similarity, Pointwise Mutual Information (PMI), or the Chi-
square statistic, etc.
- Having a single relatedness function is preferred, to keep the
disambiguation process simple
- Various relatedness measures can effectively be combined
into a single score using a machine learning approach
[Ceccarelli et al., 2013]
41. Disambiguation approaches
- Consider local compatibility (including prior evidence) and
coherence with the other entity linking decisions
- Task:
- Objective function:
⇤
= arg max
X
(m,e)2
(m, e) + ( )
: Md ! E
[
{;}
coherence function for all entity
annotations in the document
local compatibility between the
mention and the assigned entity
This optimization
problem is NP-hard!
Need to resort to approximation
algorithms and heuristics
42. Disambiguation strategies
- Individually, one-mention-at-a-time
- Rank candidates for each mention, take the top ranked one (or NIL)
- Interdependence between entity linking decisions may be incorporated
in a pairwise fashion
- Collectively, all mentions in the document jointly
(m) = arg max
e2Em
score(m, e)
43. Disambiguation approaches
Approach Context Entity interdependence
Most common sense none none
Individual local
disambiguation
text none
Individual global
disambiguation
text & entities pairwise
Collective disambiguation text & entities collective
44. Individual local disambiguation
- Early entity linking approaches
- Local compatibility score can be written as a linear
combination of features
- Learn the "optimal" combination of features from training
data using machine learning
(e, m) =
X
i
ifi(e, m)
Can be both context-independent
and context-dependent features
45. Individual global disambiguation
- Consider what other entities are mentioned in the document
- True global optimization would be NP-hard
- Good approximation can be computed efficiently by
considering pairwise interdependencies for each mention
independently
- Pairwise entity relatedness scores need to be aggregated into a single
number (how coherent the given candidate entity is with the rest of the
entities in the document)
46. Approach #1: GLOW
[Ratinov et al., 2011]
- First perform local disambiguation (E*)
- The coherence of entity e with all other linked entities in the
document:
- Use the predictions in a second, global disambiguation round
gj(e, E?
) = [
X
e02E?{e}
fj(e, e0
)]
aggregator function (max or avg)
particular relatedness function
score(m, e) =
X
i
ifi(e, m) +
X
j
jgj(e, E?
)
47. Approach #2: TAGME
[Ferragina & Scaiella, 2010]
- Combine the two most important features (commonness and
relatedness) using a voting scheme
- The score of a candidate entity for a particular mention:
- The vote function estimates the agreement between e and all
candidate entities of all other mentions in the document
score(m, e) =
X
m02Md{m}
vote(m0
, e)
48. TAGME (voting mechanism)
- Average relatedness between each possible disambiguation,
weighted by its commonness score
vote(m0
, e) =
P
e02Em0
WLM(e, e0
)P(e0
|m0
)
|Em0 |
entity
entity
mention mention m
entity
entity
entity e
entity
entity e’
mention m’
entity
P(e’|m’)
WLM(e,e’)
49. TAGME (final score)
- Final decision uses a simple but robust heuristic
- The top entities with the highest score are considered for a given
mention and the one with the highest commonness score is selected
(m) = arg max
e2Em
{P(e|m) : e 2 top✏[score(m, e)]}
score(m, e) =
X
m02Md{m}
vote(m0
, e)
50. Collective disambiguation
- Graph-based representation
- Mention-entity edges capture the local compatibility
between the mention and the entity
- Measured using a combination of context-independent and context-
dependent features
- Entity-entity edges represent the semantic relatedness
between a pair of entities
- Common choice is relatedness (WLM)
- Use these relations jointly to identify a single referent entity
(or none) for each of the mentions
51. Example
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
52. AIDA [Hoffart et al., 2011]
- Problem formulation: find a dense subgraph that contains all
mention nodes and exactly one mention-entity edge for each
mention
- Greedy algorithm iteratively removes edges
53. Algorithm
- Start with the full graph
- Iteratively remove the entity node with the lowest weighted
degree (along with all its incident edges), provided that each
mention node remains connected to at least one entity
- Weighted degree of an entity node is the sum of the weights of its
incident edges
- The graph with the highest density is kept as the solution
- The density of the graph is measured as the minimum weighted degree
among its entity nodes
54. Example iteration #1
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
0.01
weighted degree
Which entity should be removed?
0.95
0.86
0.03
1.56
0.12
xx
What is the density of the graph? 0.03
55. Example iteration #2
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
weighted degree
Which entity should be removed?
0.95
0.86
0.03
1.56
0.12
xx
x x
What is the density of the graph? 0.12
56. Example iteration #3
edge between a name mention and an entity represents a
Compatible relation between them; each edge between two
entities represents a Semantic-Related relation between them.
For illustration, Figure 2 shows the Referent Graph representation
of the EL problem in Example 1.
Space Jam
Chicago Bulls
Bull
Michael Jordan
Michael I. Jordan
Michael B. Jordan
Space Jam
Bulls Jordan
Mention
Entity
0.66
0.82
0.13
0.01
0.20
0.12
0.03
0.08
Figure 2. The Referent Graph of Example 1
By representing both the local mention-to-entity compatibility
2) Ca
can
in
Wi
ent
tex
3) No
to
Co
ref
pai
Re
the
4. CO
In this se
which c
mentions
weighted degree
Which entity should be removed?
0.95
0.86
0.03
1.56
0.12
xx
x x
x x
What is the density of the graph? 0.86
57. Pre- and post-processing
- Pre-processing phase: remove entities that are "too distant"
from the mention nodes
- At the end of the iterations, the solution graph may still
contain mentions that are connected to more than one entity;
deal with this in post-processing
- If the graph is sufficiently small, it is feasible to exhaustively consider all
possible mention-entity pairs
- Otherwise, a faster local (hill-climbing) search algorithm may be used
58. Random walk with restarts [Han et al., 2011]
- "A random walk with restart is a stochastic process to traverse
a graph, resulting in a probability distribution over the vertices
corresponding to the likelihood those vertices are
visited" [Guo et al., 2014]
- Starting vector holding initial evidence
v(m) =
tfidf(m)
P
m0inMd
tfidf(m0)
59. Random walk process
- T is evidence propagation matrix
- Evidence propagation ratio
- Evidence propagation
T(m ! e) =
w(m ! e)
P
e02Em
w(m ! e0)
T(e ! e0
) =
w(e ! e0
)
P
e002Ed
w(e ! e00)
r0
= v
rt+1
= (1 ↵) ⇥ rt
⇥ T + ↵ ⇥ v
vector holding the probability
that the nodes are visited
restart probability (0.1)
initial evidence
60. Final decision
- Once the random walk process has converged to a stationary
distribution, the referent entity for each mention
(m) = arg max
e2Em
(m, e) ⇥ r(e)
local compatibility
(e, m) =
X
i
ifi(e, m)
61. Pruning
- Discarding meaningless or low-confidence annotations
produced by the disambiguation phase
- Simplest solution: use a confidence threshold
- More advanced solutions
- Machine learned classifier to retain only entities that are ``relevant
enough'' (human editor would annotate them) [Milne & Witten, 2008]
- Optimization problem: decide, for each mention, whether switching the
top ranked disambiguation to NIL would improve the objective function
[Ratinov et al., 2011]
63. Publicly available entity linking systems
System
Reference
KB
Online
demo
Web
API
Source
code
AIDA
http://www.mpi-inf.mpg.de/yago-naga/aida/
YAGO2 Yes Yes Yes (Java)
DBpedia Spotlight
http://spotlight.dbpedia.org/
DBpedia Yes Yes Yes (Java)
Illinois Wikifier
http://cogcomp.cs.illinois.edu/page/download_view/Wikifier
Wikipedia No No Yes (Java)
TAGME
https://tagme.d4science.org/tagme/
Wikipedia Yes Yes Yes (Java)
Wikipedia Miner
https://github.com/dnmilne/wikipediaminer
Wikipedia No No Yes (Java)
64. Publicly available entity linking systems
- AIDA [Hoffart et al., 2011]
- Collective disambiguation using a graph-based approach (see under
Disambiguation section)
- DBpedia Spotlight [Mendes et al., 2011]
- Straightforward local disambiguation approach
- Vector space representations of entities are compared against the paragraphs of
their mentions using the cosine similarity;
- Introduce Inverse Candidate Frequency (ICF) instead of using IDF
- Annotate with DBpedia entities, which can be restricted to certain
types or even to a custom entity set defined by a SPARQL query
65. Publicly available entity linking systems
- Illinois Wikifier (a.k.a. GLOW) [Ratinov et al., 2011]
- Individual global disambiguation (see under Disambiguation section)
- TAGME [Ferragina & Scaiella, 2011]
- One of the most popular entity linking systems
- Designed specifically for efficient annotation of short texts, but works
well on long texts too
- Individual global disambiguation (see under Disambiguation section)
- Wikipedia Miner [Milne & Witten, 2008]
- Seminal entity linking system that was first to ...
- combine commonness and relatedness for (local) disambiguation
- have open-source implementation and be offered as a web service
67. Evaluation (end-to-end)
- Comparing the system-generated annotations against a
human-annotated gold standard
- Evaluation criteria
- Perfect match: both the linked entity and the mention offsets must
match
- Relaxed match: the linked entity must match, it is sufficient if the
mention overlaps with the gold standard
68. Evaluation with relaxed match
Example #1 Example #2
ground truth system annotation ground truth system annotation
69. Evaluation metrics
- Set-based metrics:
- Precision: fraction of correctly linked entities that have been annotated
by the system
- Recall: fraction of correctly linked entities that should be annotated
- F-measure: harmonic mean of precision and recall
- Metrics are computed over a collection of documents
- Micro-averaged: aggregated across mentions
- Macro-averaged: aggregated across documents
70. Evaluation metrics
- Micro-averaged
- Macro-averaged
- F1 score
Pmic =
|AD ˆAD|
|AD|
Rmic =
|AD ˆAD|
| ˆAD|
ground truth
annotations
annotations generated by
the entity linking system
Pmac =
X
d2D
|Ad ˆAd|
|Ad|
/|D| Rmac =
X
d2D
|Ad ˆAd|
| ˆAd|
/|D|
F1 =
2 ⇥ P ⇥ R
P + R
71. Test collections by individual researchers
System
Reference
KB
Documents #Docs Annots. #Mentions All? NIL
MSNBC
[Cucerzan, 2007]
Wikipedia News 20 entities 755 Yes Yes
AQUAINT
[Milne & Witten, 2008]
Wikipedia News 50
entities &
concepts
727 No No
IITB
[Kulkarni et al, 2009]
Wikipedia Web pages 107 entities 17,200 Yes Yes
ACE2004
[Ratinov et al., 2011]
Wikipedia News 57 entities 306 Yes Yes
CoNLL-YAGO
[Hoffart et al., 2011]
YAGO2 News 1,393 entites 34,956 Yes Yes
73. Component-based evaluation
- The pipeline architecture makes the evaluation of entity
linking systems especially challenging
- The main focus is on the disambiguation component, but its
performance is largely influenced by the preceding steps
- Fair comparison between two approaches can only be made
if they share all other elements of the pipeline
74. Meta evaluations
- [Hatchey et al., 2009]
- Reimplements three state-of-the-art systems
- DEXTER [Ceccarelli et al., 2013]
- Reimplement TAGME, Wikipedia Miner, and collective EL [Han et al., 2011]
- BAT-Framework [Cornolti et al., 2013]
- Comparing publicly available entity annotation systems (AIDA, Illinois
Wikifier, TAGME, Wikipedia Miner, DBpedia Spotlight)
- Number of test collections (news, web pages, and tweets)
- GEBRIL [Usbeck et al., 2015]
- Extends the BAT-Framework by being able to link to any knowledge base
- Benchmarking any annotation service through a REST interface
- Provides persistent URLs for experimental settings (for reproducibility)
76. A Cross-Lingual Dictionary for English
Wikipedia Concepts
- Collecting strings (mentions) that link to Wikipedia articles on
the Web
- https://research.googleblog.com/2012/05/from-words-to-concepts-
and-back.html
number of times the mention
is linked to that article
string (mention) Wikipedia article
77. Freebase Annotations of the ClueWeb
Corpora
- ClueWeb annotated with Freebase entities (by Google)
- http://lemurproject.org/clueweb09/FACC1/
- http://lemurproject.org/clueweb12/FACC1/
name of the document
that was annotated
entity mention beginning and end
byte offsets
entity ID in Freebaseconfidence given
both the mention
and the context
confidence given
just the context
(ignoring the
mention)
79. Entity linking in queries
- Challenges
- search queries are short
- limited context
- lack of proper grammar, spelling
- multiple interpretations
- needs to be fast
84. ERD’14 challenge
- Task: finding query interpretations
- Input: keyword query
- Output: sets of sets of entities
- Reference KB: Freebase
- Annotations are to be performed by a web service within a
given time limit
85. Evaluation
P =
|I
T ˆI|
|I|
R =
|I
T ˆI|
|ˆI|
F =
2 · P · R
P + R
New York City, Manhattan
ground truth system annotationˆI I
new york pizza manhattan
New York-style pizza, Manhattan
New York City, Manhattan
New York-style pizza