1. How to Measure Quality
with Disagreement?
or the Three Sides of
CrowdTruth
Lora Aroyo & Chris Welty
2. CrowdTruth
Annotator disagreement is signal, not noise.
It is indicative of the variation in human
semantic interpretation of signs
It can indicate ambiguity, vagueness,
similarity, over-generality, etc, as well as
quality
3. CrowdTruth Dependencies
worker metrics for detecting spam
à quality of sentences
à quality of the target semantics
worker quality metrics can improve significantly when the
quality of these other aspects of semantic interpretation
are considered
7. Feeling the way the CHEST expands (PALPATION), can identify areas of
the lung that are full of fluid.
?PALPATIONIs CHEST related to
diagnose location associated
with
is_a otherpart_of
0 0 02 3 0 0 0 1 0 0 44 1
Disagreement for Sentence
Clarity
Unclear relationship between the two arguments
reflected in the disagreement
8. ?CONJUNCTIVITISHYPERAEMIA related toIs
0 0 0 1 0 0 0 013 0 0 0 0 0
symptomcause
Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora)
of the eyes are symptoms common to all forms of CONJUNCTIVITIS.
Disagreement for Sentence
Clarity
Clearly expressed relation between the two
arguments reflected in the agreement
9. Sentence-Relation Score
Measures how clearly a sentence expresses a relation
0
1
1
0
0
4
3
0
0
5
1
0
Unit vector for
relation R6
Sentence
Vector
Cosine = .55
11. Worker Metrics
how much A WORKER disagrees with THE CROWD per sentence à the avg
of all cosine distances between each worker’s sentence vector & the full sentence
vector (minus that worker)
are there consistently like-minded workers à pairwise metric - avg for a
particular worker à there may be communities of thought that consistently
disagree with others, but agree within themselves
Low quality workers generally have high scores in both
avg relations per sentence à per worker the number of relations he/she
chooses per sentence averaged over all sentences he/she annotates.
High score here can help indicate low quality workers.
12. Sentence Metrics
Sentence-relation score à core CrowdTruth metric for
relation extraction à measured for each relation on each
sentence as the cosine of the unit vector for the relation
with the sentence vector
indicating that a relation is clearly or vaguely expressed,
Sentence clarity à defined for each sentence as the max
relation score for that sentence
indicating a clear or ambiguous or confusing sentence
13. Relation Metrics
Relation similarity à the causal power (pairwise conditional
probability). high similarity score indicates the relations are
confusable to workers
Relation ambiguity is defined for each relation as the max relation
similarity for the relation. If a relation is clear, then it will have a low
score.
Relation clarity à defined for each relation as the max
sentence-relation score for the relation over all sentences.
If a relation has a high clarity score, it means that
it is at least possible to express the relation clearly
Relation frequency is the number of times the relation is
annotated at least once in a sentence
16. Impact of Sentence Quality on
Worker Quality
(a) the space with no filtering of sentences or relations, a single line cannot separate the
spammers from non-spammers
(b) the space after sentence filtering, Figure (c) after relation filtering, and Figure (d)
after both sentence and relation filtering. Sentence filtering makes the classes
linearly separable, and the separation between the classes increases in the
subsequent figures.
17. Impact of Relation
Quality on Worker
Quality
(a) the space with no filtering of sentences or relations,
a single line cannot separate the spammers from non-
spammers
(c) after relation filtering
the relation filtering much more clearly
defines the space, with a large
separation between positive and
negative instances.
the pairwise improvements to the
worker scores are significant
with p < :001, which is better than the
sentence clarity improvements
18. Combining Sentence &
Relation Filtering
• first filtering out low clarity
sentences
• then filtering vague and
ambiguous relations
• worker metrics were
computed on these new
sentences and vectors
• proves to even further
separate the space, and the
pairwise improvement in
worker scores from the
baseline (unfiltered) is
significant with p < :0005.
• The improvement over
sentence filtering alone is
also significant (p < :01)
• The improvement over
relation filtering alone is only
significant with p < :05.
19. quality measures in
semantic interpretation tasks
are inter-dependent
higher accuracy can be achieved by considering the impact of
sentence quality & relation quality on worker quality measurements
significant improvement in worker quality metrics with respect
to known spammers by incorporating the quality of the individual
sentences & target relations
relationships between the different corners of the triangle of
reference, e.g.
à the impact of relation & worker quality on sentence measures,
à the impact of worker & sentence quality on relation measures