Crowdsourcing for Information Retrieval: From Statistics to Ethics

Crowdsourcing for Information Retrieval:
From Statistics to Ethics

Matt Lease
School of Information
University of Texas at Austin

@mattlease

ml@utexas.edu

Roadmap
• Scalability Challenges in IR Evaluation (brief)
• Benchmarking Statistical Consensus Methods
• Task Routing via Matrix Factorization
• Toward Ethical Crowdsourcing

Matt Lease <ml@utexas.edu>

2

Roadmap
• Scalability Challenges in IR Evaluation (brief)


3

Why Evaluation at Scale?
• Evaluation should closely
mirror real use conditions
• The best algorithm at
small scale may not be
best at larger scales
– Banko and Brill (2001)
– Halevy et al. (2009)

• IR systems should be evaluated on the scale of
data which users will search in practice

4

Why is Evaluation at Scale Hard?
• Multiple ways to evaluate; consider Cranfield
– Given a document collection and set of user queries
– Label documents for relevance to each query
– Evaluate search algorithms on these queries & documents

• Labeling data is slow/expensive/difficult
• Approach 1: label less data (e.g. active learning)
– Pooling, metrics robust to sparse data (e.g., BPref)
– Measure only relative performance (e.g., statAP, MTC)

• Approach 2: label data more efficiently
– Crowdsourcing (e.g., Amazon’s Mechanical Turk)

5

Crowdsourcing for IR Evaluation
• Origin: Alonso et al. (SIGIR Forum 2008)
– Continuing active area of research

• Primary concern: ensuring reliable data
– Reliable data provides foundation for evaluation
– If QA inefficient, overhead could reduce any savings
– Common strategy: ask multiple people to judge
relevance, then aggregate their answers (consensus)


7

Roadmap
• Scalability Challenges in Evaluating IR Systems


8

SQUARE: A Benchmark for Research
on Computing Crowd Consensus
Aashish Sheshadri and M. Lease, HCOMP’13
ir.ischool.utexas.edu/square (open source)


9

Background
• How do we resolve disagreement of multiple
peoples’ answers to arrive at consensus?
• Simple baseline: majority voting
• Long history pre-dating crowdsourcing
– Dawid and Skene’79, Smyth et al., ’95
– Recent focus on quality assurance with crowds

• Many more methods, active research topic
– Across many areas: ML, Vision, NLP, IR, DB, …

10

Why Benchmark?
• Drive field innovation by clear challenge tasks
– e.g., David Tse’s FIST 2012 Keynote (Comp. Biology)

• Many other things we can learn
– How do methods compare?
• Qualitatively & quantitatively?

– What is the state-of-the-art today?
– What works, what doesn’t, and why?
• Where is further research most needed?

– How has field progressed over time?

11

Cons Method
-

-

Most limited model
Cannot be supervised

No confusion matrix

-

Pros

Simple, fast, no training
Task-independent

MV
ZC
Demartini’12
Worker Reliability
parameters

-

Task-independent
Can be supervised
Allows priors on worker
reliability & class distribution

GLAD
-

-

-

-

-

No confusion matrix
No worker priors

Classification only
Space prop. to num classes
No worker priors

Classification only
No worker priors

Classification only
Automatic classifier requires
feature representation

Classification only
Complex
with
many
hyper-parameters.
Unclear how to supervise

Whitehill et al.’09
Worker Reliability &
Task Difficulty params

Naïve Bayes (NB)
Snow et al.,’08
= D&S Model fully-supervised

Dawid & Skene’79 (DS)
Class priors &
Worker Confusion matrices

Raykar et al.’10 (RY)
Worker confusion, sensitivity, specificity
(Optional) Automatic Classifier

-

Task-independent
Can be supervised
Prior on class distribution

-

Supports multi-class tasks
Models worker confusion
Simple maximum-likelihood

-

Supports multi-class tasks
Models worker confusion
Unsup, semi-sup, or fully-sup

-

Classifier not required
Priors on worker confusion
and class distribution.
Has multi-class support.
Can be supervised.

Welinder et al.’10 (CUBAM)

-

Worker reliability and confusion

-

Annotation noise
Task Difficulty

More Complex

Method =
Model +
Training +
Inference

Confusion Matrix

Detailed model of the
annotation process.
Can identify worker clusters .
Has multi-class support.
12

Results: Unsupervised Accuracy
Relative gain/loss vs. majority voting
15%

10%

5%

0%

-5%
DS

ZC

RY

GLAD

CUBCAM

-10%

-15%
BM

HCB

SpamCF

WVSCM

WB

RTE

TEMP

WSD

AC2

HC

ALL

14

Results: Varying Supervision


15

Findings
• Majority voting never best, rarely much worse
• Each method often best for some condition
– E.g., original dataset designed for

• DS & RY tend to perform best (RY adds priors)
• No method performs far beyond others
– Of course, contributions aren’t just empirical…


16

Why Don’t We See Bigger Gains?
• Gold is too noisy to detect improvement?
– Cormack & Kolcz’09, Klebanov & Beigman’10

• Limited tasks / scenarios considered?
– e.g., we exclude hybrid methods & worker filtering

• Might we see greater differences from
– Better benchmark tests?
– Better tuning of methods?
– Additional methods?

• We invite community contributions!

17

Roadmap


18

Crowdsourced Task Routing via
Matrix Factorization
HyunJoon Jung and M. Lease
arXiv 1310.5142, under review



20

Task Routing: Background
• Selection vs. recommendation vs. assignment
– Potential to improve work quality & satisfaction
– task search time has latency & is uncompensated
– Tradeoffs in push vs. pull, varying models

• Many matching criteria one could consider
– Preferences, Experience, Skills, Job constraints, …

• References
– Law and von Ahn, 2011 (Ch. 4)
– Chilton et al., 2010
• MTurk “free” selection constrained by search interface

21

Matrix Factorization Approach
• Collaborative filtering-based recommendation
• Intuition: achieve similar accuracy on similar tasks
– Notion is more general: e.g. preference, expertise, etc.
Worker-example matrix for each task
w1

Comprehensive worker-task matrix

..

wm

w1
0

e1

w2

w2

..

wm

w1
0

w2
0
0
1

..

1

1

0

…

1

1

1

Tn

1

1

e2

e1
1

…

e2

e1
1

en

…

e2
1
…

en

N Tasks

en

1
1
1
1

w1
T1

w2

..

0.39

wm

w2

..

wm

T1

0.72

0.59

0.70

0.75

T2

0.5

0.54

0.66

0.73

…

0.66

0.71

0.78

0.89

Tn

0.55

w1

0.87

0.83

0.72

0.91

wm
T2

0.5

0.54

0.66

0

Accumulate
repeated
crowdsourced
data

0.78
0.83

0.89

0.72

Tabularize a
worker-task
relational
model

Apply MF for
inferring
missing values

Select bestpredicted
workers for a
target task
22

Matrix Factorization
• Automatically induce latent features
– Task-independent

• Popular due to robustness to sparsity
– SVD sensitive matrix density; PMF much more robust
M workers (M>>N)

Worker Features
T

N tasks

»

Rij

WT

D = N-1
dimensions

T  R D M

Rij  Wi T j   W T ik T jk
T

k

Task Features
e.g., rating of user i for movie j

W  R D N


23

Datasets
• 3 MTurk text tasks
• Simulated data

24

Baselines
• Random assignment
– no accuracy prediction; just for task routing

• Simple average
– Average worker’s accuracies across past tasks

• Weighted average
– weight each task in average by similarity to target task
• task similarity must be estimated from data

25

Estimating Task Similarity
• Define by Pearson correlation over per-task
accuracies of workers who perform both
– Ignore any workers doing only one of the tasks


26

Results – RMSE & Mean Acc. (MTurk data)
Average over tasks
k = 1 to 20 workers

Per-task & Average
k=10 workers


27

Findings
• How does MF prediction accuracy vary given
task similarity, matrix size, & matrix density?
– Feasible, PMF beats SVD, more data = better…

• MF task routing vs. baselines?
– Much better than random; baselines fine in most
sparse conditions; improvement beyond that


28

Open Questions
• Other ways to infer task similarity (e.g. textual)
• Under “Big Data” conditions?
• When integrating target task observations?

• How to better model crowd & spam?
• How to address live task routing challenges?

29

Roadmap


30

A Few Moral Dilemmas
• A “fair” price for online work in a global economy?
– Is it better to pay nothing (i.e., volunteers, gamification)
rather than pay something small for valuable work?

• Are we obligated to inform people how their
participation / work products will be used?
– If my IRB doesn’t require me to obtain informed consent,
is there some other moral obligation to do so?

• A worker finds his ID posted in a researcher’s online
source code and asks that it be removed. This can’t
be done without recreating the repo, which many
people use. What should be done?

31

Mechanical Turk is Not Anonymous

Matthew Lease, Jessica Hullman, Jeffrey P. Bigham, Michael S. Bernstein, Juho Kim,
Walter S. Lasecki, Saeideh Bakhshi, Tanushree Mitra, and Robert C. Miller.
Online: Social Science Research Network, March 6, 2013
ssrn.com/abstract=2190946

`

Amazon profile page
URLs use the same
IDs as used on MTurk
How do we respond
when we learn we’ve
exposed people to risk?

33

Ethical Crowdsourcing
• Assume researchers have good intentions, and
so issues of gross negligence are rare
– Withholding promised pay after work performed
– Not obtaining or complying with IRB oversight

• Instead, great challenge is how to recognize our
impacts appropriate actions in a complex world
– Educating ourselves takes time & effort
– Failing to educate ourselves could harm to others

• How can we strike a reasonable balance between
complete apathy vs. being overly alarmist?

34

CACM August, 2013

Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013.


35

•
•
•
•
•

Contribute to society and human well-being
Avoid harm to others
Be honest and trustworthy
Be fair and take action not to discriminate
Respect the privacy of others

COMPLIANCE WITH THE CODE. As an ACM member I will
– Uphold and promote the principles of this Code
– Treat violations of this code as inconsistent with
membership in the ACM

36

CS2008 Curriculum Update (ACM, IEEE)
There is reasonably wide agreement that this topic of legal, social,
professional and ethical should feature in all computing degrees.
…financial and economic imperatives …Which approaches are less
expensive and is this sensible? With the advent of outsourcing and
off-shoring these matters become more complex and take on new
dimensions …there are often related ethical issues concerning
exploitation… Such matters ought to feature in courses on legal,
ethical and professional practice.
if ethical considerations are covered only in the standalone course and
not “in context,” it will reinforce the false notion that technical processes
are void of ethical issues. Thus it is important that several traditional
courses include modules that analyze ethical considerations in the
context of the technical subject matter … It would be explicitly against
the spirit of the recommendations to have only a standalone course.

37

“Contribute to society and human
well-being; avoid harm to others”
• Do we have a moral obligation to try to ascertain
conditions under which work is performed? Or the
impact we have upon those performing the work?

• Do we feel differently when work is performed by
– Political refugees? Children? Prisoners? Disabled?

• How do we know who is doing the work, or if a
decision to work (for a given price) is freely made?
– Does it matter why someone accepts offered work?

38


39

Who are
the workers?

• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers? CHI 2010.

40

Some Notable Prior Research
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these people
who we ask to power our computing?”

– “abstraction hides detail'‘ - some details may be worth
keeping conspicuously present (Jessica Hullman)

• Irani and Silberman (2013)
– “…AMT helps employers see themselves as builders of
innovative technologies, rather than employers unconcerned
with working conditions.”
– “…human computation currently relies on worker invisibility.”

• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately value
ethics above cost savings.”

41

Power Asymmetry on MTurk

• Mistakes happen, such as wrongly rejecting work – e.g., error by
new student, software bug, poor instructions, noisy gold, etc.
• How do we balance the harm caused by our mistakes to workers
(our liability) vs. our cost/effort of preventing such mistakes?

42

Task Decomposition
By minimizing context, greater task efficiency &
accuracy can often be achieved in practice
– e.g. “Can you name who is in this photo?”

• Much research on ways to streamline work
and decompose complex tasks

43

Context & Informed Consent

• Assume we wish to obtain informed consent
• Without context, consent cannot be informed
– Zittrain, Ubiquitous human computing (2008)

44

Independent Contractors vs. Employees
• Wolfson & Lease, ASIS&T’11
• Many platforms classify workers as independent
contractors (piece-work, not hourly)
– Legislators/courts must ultimately decide

• Different work classifications yield different legal
rights/protections & responsibilities
– Domestic vs. international workers
– Employment taxes
– Litigation can both cause or redress harm

• Law aside, to what extent do moral principles
underlying current laws apply to online work?

45

Consequences of Human Computation
as a Panacea where AI Falls Short
•
•
•
•

The Googler who Looked at the Worst of the Internet
Policing the Web’s Lurid Precincts
Facebook content moderation
The dirty job of keeping Facebook clean

• Even linguistic annotators report stress &
nightmares from reading news articles!

46

What about Freedom?
• Crowdsourcing vision: empowering freedom
– work whenever you want for whomever you want

• Risk: people compelled to perform work
– Chinese prisoners farming gold online
– Digital sweat shops? Digital slaves?
– We know relatively little today about work conditions
– How might we monitor and mitigate risk/growth of
crowd work inflicting harm to at-risk populations?

– Traction? Human Trafficking at MSR Summit’12

47

Robert Sim, MSR Summit’12


48

Join the conversation!
Crowdwork-ethics, by Six Silberman
http://crowdwork-ethics.wtf.tw

an informal, occasional blog for researchers
interested in ethical issues in crowd work


49

The Future of Crowd Work, CSCW’13

Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton

50

Additional References
• Irani, Lilly C. The Ideological Work of Microwork. In preparation,
draft available online.
• Adda, Gilles, et al. Crowdsourcing for language resource
development: Critical analysis of amazon mechanical turk
overpowering use. Proceedings of the 5th Language and Technology
Conference (LTC). 2011.
• Adda, Gilles, and Joseph J. Mariani. Economic, Legal and Ethical
analysis of Crowdsourcing for Speech Processing. (2013).
• Harris, Christopher G., and Padmini Srinivasan. Crowdsourcing and
Ethics. Security and Privacy in Social Networks. 67-83. 2013.
• Harris, Christopher G. Dirty Deeds Done Dirt Cheap: A Darker Side
to Crowdsourcing. IEEE 3rd conference on social computing
(socialcom). 2011.

• Horton, John J. The condition of the Turking class: Are online
employers fair and honest?. Economics Letters 111.1 (2011): 10-12.

51

Additional References (2)
• Bederson, B. B., & Quinn, A. J. Web workers unite! addressing challenges
of online laborers. In CHI 2011 Human Computation Workshop, 97-106.
• Bederson, B. B., & Quinn, A. J. Participation in Human Computation. In
CHI 2011 Human Computation Workshop.

• Felstiner, Alek. Working the Crowd: Employment and Labor Law in the
Crowdsourcing Industry. Berkeley J. Employment & Labor Law 32.1 2011
• Felstiner, Alek. Sweatshop or Paper Route?: Child Labor Laws and InGame Work. CrowdConf (2010).
• Larson, Martha. Toward Responsible and Sustainable Crowsourcing.
Blog post + Slides from Dagstuhl, September 2013.
• Vili Lehdonvirta and Paul Mezier. Identity and Self-Organization in
Unstructured Work. Unpublished working paper. 16 October 2013.
• Zittrain, Jonathan. Minds for Sale. You Tube.

52

Thank You!
See also: SIAM’13 Tutorial

Slides: www.slideshare.net/mattlease

ir.ischool.utexas.edu


53

Crowdsourcing for Information Retrieval: From Statistics to Ethics

Recommended

Recommended

More Related Content

Similar to Crowdsourcing for Information Retrieval: From Statistics to Ethics

Similar to Crowdsourcing for Information Retrieval: From Statistics to Ethics (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Crowdsourcing for Information Retrieval: From Statistics to Ethics