Crowdsourcing for Information Retrieval: From Statistics to Ethics
1. Crowdsourcing for Information Retrieval:
From Statistics to Ethics
Matt Lease
School of Information
University of Texas at Austin
@mattlease
ml@utexas.edu
2. Roadmap
• Scalability Challenges in IR Evaluation (brief)
• Benchmarking Statistical Consensus Methods
• Task Routing via Matrix Factorization
• Toward Ethical Crowdsourcing
Matt Lease <ml@utexas.edu>
2
3. Roadmap
• Scalability Challenges in IR Evaluation (brief)
• Benchmarking Statistical Consensus Methods
• Task Routing via Matrix Factorization
• Toward Ethical Crowdsourcing
Matt Lease <ml@utexas.edu>
3
4. Why Evaluation at Scale?
• Evaluation should closely
mirror real use conditions
• The best algorithm at
small scale may not be
best at larger scales
– Banko and Brill (2001)
– Halevy et al. (2009)
• IR systems should be evaluated on the scale of
data which users will search in practice
Matt Lease <ml@utexas.edu>
4
5. Why is Evaluation at Scale Hard?
• Multiple ways to evaluate; consider Cranfield
– Given a document collection and set of user queries
– Label documents for relevance to each query
– Evaluate search algorithms on these queries & documents
• Labeling data is slow/expensive/difficult
• Approach 1: label less data (e.g. active learning)
– Pooling, metrics robust to sparse data (e.g., BPref)
– Measure only relative performance (e.g., statAP, MTC)
• Approach 2: label data more efficiently
– Crowdsourcing (e.g., Amazon’s Mechanical Turk)
Matt Lease <ml@utexas.edu>
5
7. Crowdsourcing for IR Evaluation
• Origin: Alonso et al. (SIGIR Forum 2008)
– Continuing active area of research
• Primary concern: ensuring reliable data
– Reliable data provides foundation for evaluation
– If QA inefficient, overhead could reduce any savings
– Common strategy: ask multiple people to judge
relevance, then aggregate their answers (consensus)
Matt Lease <ml@utexas.edu>
7
8. Roadmap
• Scalability Challenges in Evaluating IR Systems
• Benchmarking Statistical Consensus Methods
• Task Routing via Matrix Factorization
• Toward Ethical Crowdsourcing
Matt Lease <ml@utexas.edu>
8
9. SQUARE: A Benchmark for Research
on Computing Crowd Consensus
Aashish Sheshadri and M. Lease, HCOMP’13
ir.ischool.utexas.edu/square (open source)
Matt Lease <ml@utexas.edu>
9
10. Background
• How do we resolve disagreement of multiple
peoples’ answers to arrive at consensus?
• Simple baseline: majority voting
• Long history pre-dating crowdsourcing
– Dawid and Skene’79, Smyth et al., ’95
– Recent focus on quality assurance with crowds
• Many more methods, active research topic
– Across many areas: ML, Vision, NLP, IR, DB, …
Matt Lease <ml@utexas.edu>
10
11. Why Benchmark?
• Drive field innovation by clear challenge tasks
– e.g., David Tse’s FIST 2012 Keynote (Comp. Biology)
• Many other things we can learn
– How do methods compare?
• Qualitatively & quantitatively?
– What is the state-of-the-art today?
– What works, what doesn’t, and why?
• Where is further research most needed?
– How has field progressed over time?
Matt Lease <ml@utexas.edu>
11
12. Cons Method
-
-
Most limited model
Cannot be supervised
No confusion matrix
-
Pros
Simple, fast, no training
Task-independent
MV
ZC
Demartini’12
Worker Reliability
parameters
-
Task-independent
Can be supervised
Allows priors on worker
reliability & class distribution
GLAD
-
-
-
-
-
No confusion matrix
No worker priors
Classification only
Space prop. to num classes
No worker priors
Classification only
Space prop. to num classes
No worker priors
Classification only
Space prop. to num classes
Automatic classifier requires
feature representation
Classification only
Complex
with
many
hyper-parameters.
Unclear how to supervise
Whitehill et al.’09
Worker Reliability &
Task Difficulty params
Naïve Bayes (NB)
Snow et al.,’08
= D&S Model fully-supervised
Dawid & Skene’79 (DS)
Class priors &
Worker Confusion matrices
Raykar et al.’10 (RY)
Worker confusion, sensitivity, specificity
(Optional) Automatic Classifier
-
Task-independent
Can be supervised
Prior on class distribution
-
Supports multi-class tasks
Models worker confusion
Simple maximum-likelihood
-
Supports multi-class tasks
Models worker confusion
Unsup, semi-sup, or fully-sup
-
Classifier not required
Priors on worker confusion
and class distribution.
Has multi-class support.
Can be supervised.
Welinder et al.’10 (CUBAM)
-
Worker reliability and confusion
-
Annotation noise
Task Difficulty
More Complex
Method =
Model +
Training +
Inference
Confusion Matrix
Detailed model of the
annotation process.
Can identify worker clusters .
Has multi-class support.
12
16. Findings
• Majority voting never best, rarely much worse
• Each method often best for some condition
– E.g., original dataset designed for
• DS & RY tend to perform best (RY adds priors)
• No method performs far beyond others
– Of course, contributions aren’t just empirical…
Matt Lease <ml@utexas.edu>
16
17. Why Don’t We See Bigger Gains?
• Gold is too noisy to detect improvement?
– Cormack & Kolcz’09, Klebanov & Beigman’10
• Limited tasks / scenarios considered?
– e.g., we exclude hybrid methods & worker filtering
• Might we see greater differences from
– Better benchmark tests?
– Better tuning of methods?
– Additional methods?
• We invite community contributions!
Matt Lease <ml@utexas.edu>
17
18. Roadmap
• Scalability Challenges in Evaluating IR Systems
• Benchmarking Statistical Consensus Methods
• Task Routing via Matrix Factorization
• Toward Ethical Crowdsourcing
Matt Lease <ml@utexas.edu>
18
19. Crowdsourced Task Routing via
Matrix Factorization
HyunJoon Jung and M. Lease
arXiv 1310.5142, under review
Matt Lease <ml@utexas.edu>
21. Task Routing: Background
• Selection vs. recommendation vs. assignment
– Potential to improve work quality & satisfaction
– task search time has latency & is uncompensated
– Tradeoffs in push vs. pull, varying models
• Many matching criteria one could consider
– Preferences, Experience, Skills, Job constraints, …
• References
– Law and von Ahn, 2011 (Ch. 4)
– Chilton et al., 2010
• MTurk “free” selection constrained by search interface
Matt Lease <ml@utexas.edu>
21
22. Matrix Factorization Approach
• Collaborative filtering-based recommendation
• Intuition: achieve similar accuracy on similar tasks
– Notion is more general: e.g. preference, expertise, etc.
Worker-example matrix for each task
w1
Comprehensive worker-task matrix
..
wm
w1
0
e1
w2
w2
..
wm
w1
0
w2
0
0
1
..
1
1
0
…
1
1
1
Tn
1
1
e2
e1
1
…
e2
e1
1
en
…
e2
1
…
en
N Tasks
en
1
1
1
1
w1
T1
w2
..
0.39
wm
w2
..
wm
T1
0.72
0.59
0.70
0.75
T2
0.5
0.54
0.66
0.73
…
0.66
0.71
0.78
0.89
Tn
0.55
w1
0.87
0.83
0.72
0.91
wm
T2
0.5
0.54
0.66
0
Accumulate
repeated
crowdsourced
data
0.78
0.83
0.89
0.72
Tabularize a
worker-task
relational
model
Matt Lease <ml@utexas.edu>
Apply MF for
inferring
missing values
Select bestpredicted
workers for a
target task
22
23. Matrix Factorization
• Automatically induce latent features
– Task-independent
• Popular due to robustness to sparsity
– SVD sensitive matrix density; PMF much more robust
M workers (M>>N)
Worker Features
T
N tasks
»
Rij
WT
D = N-1
dimensions
T R D M
Rij Wi T j W T ik T jk
T
k
Task Features
e.g., rating of user i for movie j
W R D N
Matt Lease <ml@utexas.edu>
23
25. Baselines
• Random assignment
– no accuracy prediction; just for task routing
• Simple average
– Average worker’s accuracies across past tasks
• Weighted average
– weight each task in average by similarity to target task
• task similarity must be estimated from data
Matt Lease <ml@utexas.edu>
25
26. Estimating Task Similarity
• Define by Pearson correlation over per-task
accuracies of workers who perform both
– Ignore any workers doing only one of the tasks
Matt Lease <ml@utexas.edu>
26
27. Results – RMSE & Mean Acc. (MTurk data)
Average over tasks
k = 1 to 20 workers
Per-task & Average
k=10 workers
Matt Lease <ml@utexas.edu>
27
28. Findings
• How does MF prediction accuracy vary given
task similarity, matrix size, & matrix density?
– Feasible, PMF beats SVD, more data = better…
• MF task routing vs. baselines?
– Much better than random; baselines fine in most
sparse conditions; improvement beyond that
Matt Lease <ml@utexas.edu>
28
29. Open Questions
• Other ways to infer task similarity (e.g. textual)
• Under “Big Data” conditions?
• When integrating target task observations?
• How to better model crowd & spam?
• How to address live task routing challenges?
Matt Lease <ml@utexas.edu>
29
30. Roadmap
• Scalability Challenges in Evaluating IR Systems
• Benchmarking Statistical Consensus Methods
• Task Routing via Matrix Factorization
• Toward Ethical Crowdsourcing
Matt Lease <ml@utexas.edu>
30
31. A Few Moral Dilemmas
• A “fair” price for online work in a global economy?
– Is it better to pay nothing (i.e., volunteers, gamification)
rather than pay something small for valuable work?
• Are we obligated to inform people how their
participation / work products will be used?
– If my IRB doesn’t require me to obtain informed consent,
is there some other moral obligation to do so?
• A worker finds his ID posted in a researcher’s online
source code and asks that it be removed. This can’t
be done without recreating the repo, which many
people use. What should be done?
Matt Lease <ml@utexas.edu>
31
32. Mechanical Turk is Not Anonymous
Matthew Lease, Jessica Hullman, Jeffrey P. Bigham, Michael S. Bernstein, Juho Kim,
Walter S. Lasecki, Saeideh Bakhshi, Tanushree Mitra, and Robert C. Miller.
Online: Social Science Research Network, March 6, 2013
ssrn.com/abstract=2190946
33. `
Amazon profile page
URLs use the same
IDs as used on MTurk
How do we respond
when we learn we’ve
exposed people to risk?
33
34. Ethical Crowdsourcing
• Assume researchers have good intentions, and
so issues of gross negligence are rare
– Withholding promised pay after work performed
– Not obtaining or complying with IRB oversight
• Instead, great challenge is how to recognize our
impacts appropriate actions in a complex world
– Educating ourselves takes time & effort
– Failing to educate ourselves could harm to others
• How can we strike a reasonable balance between
complete apathy vs. being overly alarmist?
Matt Lease <ml@utexas.edu>
34
35. CACM August, 2013
Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.
Matt Lease <ml@utexas.edu>
35
36. •
•
•
•
•
Contribute to society and human well-being
Avoid harm to others
Be honest and trustworthy
Be fair and take action not to discriminate
Respect the privacy of others
COMPLIANCE WITH THE CODE. As an ACM member I will
– Uphold and promote the principles of this Code
– Treat violations of this code as inconsistent with
membership in the ACM
Matt Lease <ml@utexas.edu>
36
37. CS2008 Curriculum Update (ACM, IEEE)
There is reasonably wide agreement that this topic of legal, social,
professional and ethical should feature in all computing degrees.
…financial and economic imperatives …Which approaches are less
expensive and is this sensible? With the advent of outsourcing and
off-shoring these matters become more complex and take on new
dimensions …there are often related ethical issues concerning
exploitation… Such matters ought to feature in courses on legal,
ethical and professional practice.
if ethical considerations are covered only in the standalone course and
not “in context,” it will reinforce the false notion that technical processes
are void of ethical issues. Thus it is important that several traditional
courses include modules that analyze ethical considerations in the
context of the technical subject matter … It would be explicitly against
the spirit of the recommendations to have only a standalone course.
Matt Lease <ml@utexas.edu>
37
38. “Contribute to society and human
well-being; avoid harm to others”
• Do we have a moral obligation to try to ascertain
conditions under which work is performed? Or the
impact we have upon those performing the work?
• Do we feel differently when work is performed by
– Political refugees? Children? Prisoners? Disabled?
• How do we know who is doing the work, or if a
decision to work (for a given price) is freely made?
– Does it matter why someone accepts offered work?
Matt Lease <ml@utexas.edu>
38
40. Who are
the workers?
• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.
Matt Lease <ml@utexas.edu>
40
41. Some Notable Prior Research
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these people
who we ask to power our computing?”
– “abstraction hides detail'‘ - some details may be worth
keeping conspicuously present (Jessica Hullman)
• Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders of
innovative technologies, rather than employers unconcerned
with working conditions.”
– “…human computation currently relies on worker invisibility.”
• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately value
ethics above cost savings.”
41
42. Power Asymmetry on MTurk
• Mistakes happen, such as wrongly rejecting work – e.g., error by
new student, software bug, poor instructions, noisy gold, etc.
• How do we balance the harm caused by our mistakes to workers
(our liability) vs. our cost/effort of preventing such mistakes?
Matt Lease <ml@utexas.edu>
42
43. Task Decomposition
By minimizing context, greater task efficiency &
accuracy can often be achieved in practice
– e.g. “Can you name who is in this photo?”
• Much research on ways to streamline work
and decompose complex tasks
Matt Lease <ml@utexas.edu>
43
44. Context & Informed Consent
• Assume we wish to obtain informed consent
• Without context, consent cannot be informed
– Zittrain, Ubiquitous human computing (2008)
44
45. Independent Contractors vs. Employees
• Wolfson & Lease, ASIS&T’11
• Many platforms classify workers as independent
contractors (piece-work, not hourly)
– Legislators/courts must ultimately decide
• Different work classifications yield different legal
rights/protections & responsibilities
– Domestic vs. international workers
– Employment taxes
– Litigation can both cause or redress harm
• Law aside, to what extent do moral principles
underlying current laws apply to online work?
Matt Lease <ml@utexas.edu>
45
46. Consequences of Human Computation
as a Panacea where AI Falls Short
•
•
•
•
The Googler who Looked at the Worst of the Internet
Policing the Web’s Lurid Precincts
Facebook content moderation
The dirty job of keeping Facebook clean
• Even linguistic annotators report stress &
nightmares from reading news articles!
Matt Lease <ml@utexas.edu>
46
47. What about Freedom?
• Crowdsourcing vision: empowering freedom
– work whenever you want for whomever you want
• Risk: people compelled to perform work
– Chinese prisoners farming gold online
– Digital sweat shops? Digital slaves?
– We know relatively little today about work conditions
– How might we monitor and mitigate risk/growth of
crowd work inflicting harm to at-risk populations?
– Traction? Human Trafficking at MSR Summit’12
Matt Lease <ml@utexas.edu>
47
49. Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw
an informal, occasional blog for researchers
interested in ethical issues in crowd work
Matt Lease <ml@utexas.edu>
49
50. The Future of Crowd Work, CSCW’13
Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
Matt Lease <ml@utexas.edu>
50
51. Additional References
• Irani, Lilly C. The Ideological Work of Microwork. In preparation,
draft available online.
• Adda, Gilles, et al. Crowdsourcing for language resource
development: Critical analysis of amazon mechanical turk
overpowering use. Proceedings of the 5th Language and Technology
Conference (LTC). 2011.
• Adda, Gilles, and Joseph J. Mariani. Economic, Legal and Ethical
analysis of Crowdsourcing for Speech Processing. (2013).
• Harris, Christopher G., and Padmini Srinivasan. Crowdsourcing and
Ethics. Security and Privacy in Social Networks. 67-83. 2013.
• Harris, Christopher G. Dirty Deeds Done Dirt Cheap: A Darker Side
to Crowdsourcing. IEEE 3rd conference on social computing
(socialcom). 2011.
• Horton, John J. The condition of the Turking class: Are online
employers fair and honest?. Economics Letters 111.1 (2011): 10-12.
Matt Lease <ml@utexas.edu>
51
52. Additional References (2)
• Bederson, B. B., & Quinn, A. J. Web workers unite! addressing challenges
of online laborers. In CHI 2011 Human Computation Workshop, 97-106.
• Bederson, B. B., & Quinn, A. J. Participation in Human Computation. In
CHI 2011 Human Computation Workshop.
• Felstiner, Alek. Working the Crowd: Employment and Labor Law in the
Crowdsourcing Industry. Berkeley J. Employment & Labor Law 32.1 2011
• Felstiner, Alek. Sweatshop or Paper Route?: Child Labor Laws and InGame Work. CrowdConf (2010).
• Larson, Martha. Toward Responsible and Sustainable Crowsourcing.
Blog post + Slides from Dagstuhl, September 2013.
• Vili Lehdonvirta and Paul Mezier. Identity and Self-Organization in
Unstructured Work. Unpublished working paper. 16 October 2013.
• Zittrain, Jonathan. Minds for Sale. You Tube.
Matt Lease <ml@utexas.edu>
52
53. Thank You!
See also: SIAM’13 Tutorial
Slides: www.slideshare.net/mattlease
ir.ischool.utexas.edu
Matt Lease <ml@utexas.edu>
53