1. Eugene Agichtein
Emory University
Atlanta, USA
Carlos Castillo
Debora Donato
Aris Gionis
Yahoo! Research
Barcelona, Spain
Gilad Mishne
Yahoo! S&A Sciences
Santa Clara, CA, USA
2.
3. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
User-generated content Traditional publishing≠
4. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Chris Anderson: “The Long Tail”. Hyperion, 2006.
Frequency
Quality
Traditional
publishing
User-
generated
5. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Chris Anderson: “The Long Tail”. Hyperion, 2006.
Quantity
Quality
User-
generated
Traditional
publishing
6. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
<!--
7. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Quantity
Quality
?
8. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Chris Martin from Coldplay in The Rolling Stone, Fortieth Aniversary, July 2007.
Quantity
Quality
“We think it's all about
quality over quantity
now, because there's
so much noise
everywhere, there's no
point in putting
anything out unless
it's fucking amazing.”
9. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Quantity
Quality
AUser-
generated
Traditional
publishing
10. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Quantity
Quality
F.A.User-
generated
Traditional
publishing
11. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
-->
12. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Quantity
Quality
?
13. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Quantity
Quality
(Hard) problem
14. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
15.
16. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Best answer
Picked by votes
-or-
Picked by asker
All answers
+ “Thumbs up”
+ “Thumbs down”
Question
+ “Stars”
17. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
¼ questions want an
opinion: informal polls
¾ questions seek for
information or advice
18.
19. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Q. Su, D. Pavlov, J.-H. Chow, W. C. Baker. “Internet-scale collection of human-reviewed data”.WWW'07.
17%-45% of
answers were correct
65%-90% of
questions had
at least one
correct answer
20. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
There are top contributors ...
... but they don't have all the answers
21. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Quantity
Quality
Task: find high-quality items
22. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Existing tools
● Link-based ranking methods
● Propagation of trust/distrust
● Automatic text analysis
● Usage mining
● ...
23. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Sources of information
● Content analysis
● Usage data (clicks)
● Community ratings
24. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Sources of information
● Content analysis
(with errors)
● Clicks
(with noise)
● Community ratings
(sparse, with spam)
25. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
26. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
27. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Language modeling
Text analysis
Readability statistics
28. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis
Language modelingReadability statistics
Punctuation density
Capitalization errors
Number of words
+ spacing density, sylablles per word,...
29. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis
Language modelingReadability statistics
G. Mishne, D. Carmel, R. Lempel: “Blocking blog spam with language model disagreement”. AIRWeb'05
Language model disagreement
Distributions of word n-grams
and part-of-speech sequences
when|how|why -- “to” -- verb
“how to identify ...”
when|how|why – verb – verb – pronoun – verb
“how do I remove ...”
30. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
31. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Clicks
If we know that a question is clicked 100 times,
and another question is clicked 10,000 times ...
... we still know nothing
32. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Clicks
Per-category average
Clicks
Per-category stdev.
Question age
33. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
34.
35.
36.
37. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Power laws
38.
39. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
P. Jurczyk, E. Agichtein: “Discovering authorities in Q.A. communities by using link analysis” CIKM'07
Askers
Answerers
40. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Community
answers
votes +
votes -
picks as best
41. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Community
Degree-based metrics
# answers given
# answers received
# votes + given
# votes + received
etc...
42. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Community
Propagation-based metrics
1. Pagerank score
2. HITS hub score
3. HITS authority score
Computed on each graph
43. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
Relations
Learning
Training labels
44. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
Relations
Learning
Training labels
45. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
High Medium Low
High 15%
Medium 76%
Low 9%
100%
Answer
quality
Question quality
46. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
High Medium Low
High 15% 8%
Medium 76% 74%
Low 9% 18%
100% 100%
Answer
quality
Question quality
47. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
High Medium Low
High 41% 15% 8%
Medium 53% 76% 74%
Low 6% 9% 18%
100% 100% 100%
Answer
quality
Question quality
48. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
High Medium Low
High 41% 15% 8%
Medium 53% 76% 74%
Low 6% 9% 18%
100% 100% 100%
Answer
quality
Question quality
Question quality and answer quality are not independent
49. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Relations: questions
A
Q
V
Q
A
A
AQ
A
V
U
Answers to the
question being
evaluated
User asking question
Question being
evaluated
Questions asked
Answers given
Votes given
QAAnswers to
questions asked
U
U
U
Answerers of
question being
evaluated
50. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
A
Q
V
A
Q
A
A
A
U
Q
A
V
U
Other answers to the
same question
Asker of question
being answered
Question being
answered
Answerer
Answer being
evaluated
Questions asked
Answers given
Votes given
QA
Answers to
questions asked
Relations: answers
51. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
Relations
Learning
Training labels
52. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
Relations
Learning:
stochastic gradient
boosted trees
Labeled data:
6K questions
8K answers
J. H. Friedman: “Stochastic gradient boosting”. Comp. Stat. Data. Anal., 38(4), 367-378, 2002.
Evaluation:
Precision, Recall (F1);
Area under ROC curve
53. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Precision Recall AUC
N-grams (N) 65% 48% 0.52
N+ text analysis 76% 65% 0.65
N+ clicks 68% 57% 0.58
N+ relations 74% 65% 0.66
All 79% 77% 0.76
Task: high-quality questions
54. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Precision Recall AUC
N-grams (N) 67% 86% 0.81
N + text analysis 71% 93% 0.88
N + clicks - - -
N + relations 69% 85% 0.82
All 73% 91% 0.87
Task: high-quality answers
55. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
In the paper ...
● Framework for quality estimation in
social media
● Graph-based model of contributor
relationships
● Details on the relative importance of
(sets of) features
56. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
What did we learn?
● Human assessments for this task
– ... have relatively low agreement
● Classifying questions/answers
– ... is substantially different from
document classification
● Look at orthogonal feature spaces
57. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Text analysis Clicks Community
Relations
Learning
Future work
Relational
learning
58. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
Thank you!
59. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.
60. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne: Finding High-Quality Content in Social Media. WSDM'08.