Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)

Crowdsourcing for Research & Engineering
Omar Alonso
Microsoft

Matthew Lease
University of Texas at Austin

November 1, 2011

November 1, 2011 Crowdsourcing for Research and Engineering 1

Tutorial Objectives
• What is crowdsourcing?
• How and when to use crowdsourcing?
• How to use Mechanical Turk
• Experimental setup and design guidelines for
working with the crowd
• Quality control: issues, measuring, and improving
• Future trends
• Research landscape and open challenges


Tutorial Outline
I. Introduction to Crowdsourcing
I. Introduction, Examples, Terminology
II. Primary focus on micro-tasks
II. Tools and Platforms: APIs and Examples

III. Methodology for effective crowdsourcing
I. Methods, Examples, and Tips
II. Quality control: monitoring & improving
IV. Future Trends

I
INTRODUCTION TO CROWDSOURCING


From Outsourcing to Crowdsourcing
• Take a job traditionally
performed by a known agent
(often an employee)
• Outsource it to an undefined,
generally large group of
people via an open call
• New application of principles
from open source movement
• Evolving & broadly defined ...

Examples

Crowdsourcing 101: Putting the WSDM of Crowds to Work for You. 6

Crowdsourcing models
• Virtual work, Micro-tasks, & Aggregators
• Open Innovation, Co-Creation, & Contests
• Citizen Science
• Prediction Markets
• Crowd Funding and Charity
• “Gamification” (not serious gaming)
• Transparent
• cQ&A, Social Search, and Polling
• Human Sensing

What is Crowdsourcing?
• A collection of mechanisms and associated
methodologies for scaling and directing crowd
activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related areas
– Collective intelligence
– Social computing
– People services
– Human computation (next slide…)
• Good work is creative, innovative, surprising, …

Human Computation
• Having people do stuff instead of computers
• Investigates use of people to execute certain
computations for which capabilities of current
automated methods are more limited
• Explores the metaphor of computation for
characterizing attributes, capabilities, and
limitations of human performance in executing
desired tasks
• Computation is required, crowd is not
• Pioneer: Luis von Ahn’s thesis (2005)

What is not crowdsourcing?
• Ingredients necessary but not sufficient
– A crowd
– Digital communication
• Post-hoc use of undirected crowd behaviors
– e.g. Data mining, visualization
• Conducting a traditional survey or poll
• Human Computation with one or few people
– E.g. traditional active learning
• …


Crowdsourcing Key Questions
• What are the goals?
– Purposeful directing of human activity

• How can you incentivize participation?
– Incentive engineering
– Who are the target participants?

• Which model(s) are most appropriate?
– How to adapt them to your context and goals?

What do you want to accomplish?
• Perform specified task(s)
• Innovate and/or discover
• Create
• Predict
• Fund
• Learn
• Monitor
November 1, 2011 Crowdsourcing for Research and Engineering

Why Should Anyone Participate?

Don’t let this happen to you …

Incentive Engineering
• Earn Money (real or virtual)
• Have fun (or pass the time)
• Socialize with others
• Obtain recognition or prestige (leaderboards, badges)
• Do Good (altruism)
• Learn something new
• Obtain something else
• Create self-serving resource

Multiple incentives can often operate in parallel (*caveat)

Models: Goal(s) + Incentives
• Virtual work, Micro-tasks, & Aggregators
• Open Innovation, Co-Creation, & Contests
• Citizen Science
• Prediction Markets
• Crowd Funding and Charity
• “Gamification” (not serious gaming)
• Transparent
• cQ&A, Social Search, and Polling
• Human Sensing

Example: Wikipedia
• Obtain recognition or prestige


Example:


Example: ESP and GWaP
L. Von Ahn and L. Dabbish (2004)


Example: ESP


Example: fold.it
S. Cooper et al. (2010)

Alice G. Walton. Online Gamers Help Solve Mystery of
Critical AIDS Virus Enzyme. The Atlantic, October 8, 2011.

Example: fold.it


Example: FreeRice


Example: cQ&A, Social Search, & Polling


Example: cQ&A


Example: reCaptcha


Example: reCaptcha
• Do Good (altruism) Is there an existing human
activity you can harness
• Learn something new for another purpose?



Example: Mechanical Turk

J. Pontin. Artificial Intelligence, With Help From
the Humans. New York Times (March 25, 2007)

Example: Mechanical Turk


Look Before You Leap a
• Wolfson & Lease (2011)
• Identify a few potential legal pitfalls to know about
when considering crowdsourcing
– employment law
– patent inventorship
– data security and the Federal Trade Commission
– copyright ownership
– securities regulation of crowdfunding
• Take-away: don’t panic, just be mindful of the law

Example: SamaSource

Incentive for YOU: Do Good
Terminology: channels

Who are
the workers?

• A. Baio, November 2008. The Faces of Mechanical Turk.
• P. Ipeirotis. March 2010. The New Demographics of
Mechanical Turk
• J. Ross, et al. Who are the Crowdworkers?... CHI 2010.

MTurk Demographics
• 2008-2009 studies found
less global and diverse
than previously thought
– US
– Female
– Educated
– Bored
– Money is secondary


2010 shows increasing diversity
47% US, 34% India, 19% other (P. Ipeitorotis. March 2010)


MICRO-TASKS++ AND EXAMPLES


Chess machine unveiled in 1770 by Wolfgang von Kempelen (1734–1804)

• “Micro-task” crowdsourcing marketplace
• On-demand, scalable, real-time workforce
• Online since 2005 (and still in “beta”)
• Programmer’s API & “Dashboard” GUI
• Sponsorship: TREC 2011 Crowdsourcing Track (pending)

Does anyone really use it? Yes!

http://www.mturk-tracker.com (P. Ipeirotis’10)

From 1/09 – 4/10, 7M HITs from 10K requestors
worth $500,000 USD (significant under-estimate)

• Labor on-demand, Channels, Quality control features
• Sponsorship
– Research Workshops: CSE’10, CSDM’11, CIR’11,
– TREC 2011 Crowdsourcing Track


CloudFactory
• Information below from Mark Sears (Oct. 18, 2011)
• Cloud Labor API
– Tools to design virtual assembly lines
– workflows with multiple tasks chained together
• Focus on self serve tools for people to easily design crowd-powered assembly lines
that can be easily integrated into software applications
• Interfaces: command-line, RESTful API, and Web
• Each “task station” can have either a human or robot worker assigned
– web software services (AlchemyAPI, SendGrid, Google APIs, Twilio, etc.) or local software can
be combined with human computation
• Many built-in "best practices"
– “Tournament Stations” where multiple results are compared by a other cloud workers until
confidence of best answer is reached
– “Improver Stations” have workers improve and correct work by other workers
– Badges are earned by cloud workers passing tests created by requesters
– Training and tools to create skill tests will be flexible
– Algorithms to detect and kick out spammers/cheaters/lazy/bad workers
• Sponsorship: TREC 2012 Crowdsourcing Track

More Crowd Labor Platforms
• Clickworker
• CloudCrowd
• CrowdSource
• DoMyStuff
• Humanoid (by Matt Swason et al.)
• Microtask
• MobileWorks (by Anand Kulkarni )
• myGengo
• SmartSheet
• vWorker
• Industry heavy-weights
– Elance
– Liveops
– oDesk
– uTest
• and more…


Why Micro-Tasks?
• Easy, cheap and fast
• Ready-to use infrastructure, e.g.
– MTurk payments, workforce, interface widgets
– CrowdFlower quality control mechanisms, etc.
– Many others …
• Allows early, iterative, frequent trials
– Iteratively prototype and test new ideas
– Try new tasks, test when you want & as you go
• Many successful examples of use reported

Micro-Task Issues
• Process
– Task design, instructions, setup, iteration
• Choose crowdsourcing platform (or roll your own)
• Human factors
– Payment / incentives, interface and interaction design,
communication, reputation, recruitment, retention
• Quality Control / Data Quality
– Trust, reliability, spam detection, consensus labeling


Legal Disclaimer:
Caution Tape and Silver Bullets

• Often still involves more art than science
• Not a magic panacea, but another alternative
– one more data point for analysis, complements other methods
• Quality may be traded off for time/cost/effort
• Hard work & experimental design still required!

Hello World Demo
• We’ll show a simple, short demo of MTurk
• This is a teaser highlighting things we’ll discuss
– Don’t worry about details; we’ll revisit them
• Specific task unimportant
• Big idea: easy, fast, cheap to label with MTurk!


Jane saw the man with the binoculars


DEMO


Traditional Data Collection
• Setup data collection software / harness
• Recruit participants
• Pay a flat fee for experiment or hourly wage

• Characteristics
– Slow
– Expensive
– Tedious
– Sample Bias


Research Using Micro-Tasks
• Let’s see examples of micro-task usage
– Many areas: IR, NLP, computer vision, user studies,
usability testing, psychological studies, surveys, …

• Check bibliography at end for more references


NLP Example – Dialect Identification


NLP Example – Spelling correction


NLP Example – Machine Translation
• Manual evaluation on translation quality is
slow and expensive
• High agreement between non-experts and
experts
• $0.10 to translate a sentence

C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.

B. Bederson et al. Translation by Interactive Collaboration between Monolingual Users, GI 2010


NLP Example – Snow et al. (2008)

• 5 Tasks
– Affect recognition
– Word similarity
– Recognizing textual entailment
– Event temporal ordering
– Word sense disambiguation
• high agreement between crowd
labels and expert “gold” labels
– assumes training data for worker bias correction
• 22K labels for $26 !

Computer Vision – Painting Similarity

Kovashka & Lease, CrowdConf’10


User Studies
• Investigate attitudes about saving, sharing, publishing,
and removing online photos
• Survey
– A scenario-based probe of respondent attitudes, designed
to yield quantitative data
– A set of questions (close and open-ended)
– Importance of recent activity
– 41 question
– 7 point scale
• 250 respondents

C. Marshall and F. Shipman. “The Ownership and Reuse of Visual Media”, JCDL 2011.


Remote Usability Testing
• Liu et al. (in preparation)
• Compares remote usability testing using MTurk and
CrowdFlower (not uTest) vs. traditional on-site testing
• Advantages
– More Participants
– More Diverse Participants
– High Speed
– Low Cost
• Disadvantages
– Lower Quality Feedback
– Less Interaction
– Greater need for quality control
– Less Focused User Groups

IR Example – Relevance and ads


IR Example – Product Search


IR Example – Snippet Evaluation
• Study on summary lengths
• Determine preferred result length
• Asked workers to categorize web queries
• Asked workers to evaluate snippet quality
• Payment between $0.01 and $0.05 per HIT

M. Kaisser, M. Hearst, and L. Lowe. “Improving Search Results Quality by Customizing Summary Lengths”, ACL/HLT, 2008.


IR Example – Relevance Assessment
• Replace TREC-like relevance assessors with MTurk?
• Selected topic “space program” (011)
• Modified original 4-page instructions from TREC
• Workers more accurate than original assessors!
• 40% provided justification for each answer

O. Alonso and S. Mizzaro. “Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment”, SIGIR Workshop
on the Future of IR Evaluation, 2009.


IR Example – Timeline Annotation
• Workers annotate timeline on politics, sports, culture
• Given a timex (1970s, 1982, etc.) suggest something
• Given an event (Vietnam, World cup, etc.) suggest a timex

K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”. ECIR 2010


How can I get started?

• You have an idea
• Easy, cheap, fast, and iterative sounds good

Can you test your idea via crowdsourcing?
• Is my idea crowdsourcable?
• How do I start?
• What do I need?

Tip for Getting Started: do work
Try doing work before you create work for others!


II
AMAZON MECHANICAL TURK


Mechanical What?


MTurk: The Requester
• Sign up with your Amazon account
• Amazon payments
• Purchase prepaid HITs
• There is no minimum or up-front fee
• MTurk collects a 10% commission
• The minimum commission charge is $0.005 per HIT


MTurk Dashboard
• Three tabs
– Design
– Publish
– Manage
• Design
– HIT Template
• Publish
– Make work available
• Manage
– Monitor progress


MTurk: Dashboard - II


MTurk API
• Amazon Web Services API
• Rich set of services
• Command line tools
• More flexibility than dashboard


MTurk Dashboard vs. API
• Dashboard
– Easy to prototype
– Setup and launch an experiment in a few minutes
• API
– Ability to integrate AMT as part of a system
– Ideal if you want to run experiments regularly
– Schedule tasks


Working on MTurk
• Sign up with your Amazon account
• Tabs
– Account: work approved/rejected
– HIT: browse and search for work
– Qualifications: browse & search qualifications
• Start turking!


Why Eytan Adar hates MTurk Research
(at least sort of)
• Overly-narrow focus on Turk & other platforms
– Identify general vs. platform-specific problems
– Academic vs. Industrial problems
• Lack of appreciation of interdisciplinary nature
– Some problems well-studied in other areas
– Human behavior hasn’t changed much
• Turks aren’t Martians
– How many prior user studies do we have to
reproduce on MTurk before we can get over it?

III
RELEVANCE JUDGING & CROWDSOURCING


Motivating Example: Relevance Judging

• Relevance of search results is difficult to judge
– Highly subjective
– Expensive to measure
• Professional editors commonly used
• Potential benefits of crowdsourcing
– Scalability (time and cost)
– Diversity of judgments


Started with a joke …


Results for {idiot} at WSDM 2011
February 2011: 5/7 (R), 2/7 (NR)
– Most of the time those TV reality stars have absolutely no talent. They do whatever
they can to make a quick dollar. Most of the time the reality tv stars don not have
a mind of their own. R
– Most are just celebrity wannabees. Many have little or no talent, they just want
fame. R
– I can see this one going both ways. A particular sort of reality star comes to
mind, though, one who was voted off Survivor because he chose not to use his
immunity necklace. Sometimes the label fits, but sometimes it might be unfair. R
– Just because someone else thinks they are an "idiot", doesn't mean that is what the
word means. I don't like to think that any one person's photo would be used to
describe a certain term. NR
– While some reality-television stars are genuinely stupid (or cultivate an image of
stupidity), that does not mean they can or should be classified as "idiots." Some
simply act that way to increase their TV exposure and potential earnings. Other
reality-television stars are really intelligent people, and may be considered as
idiots by people who don't like them or agree with them. It is too subjective an
issue to be a good result for a search engine. NR
– Have you seen the knuckledraggers on reality television? They should be required to
change their names to idiot after appearing on the show. You could put numbers
after the word idiot so we can tell them apart. R
– Although I have not followed too many of these shows, those that I have encountered
have for a great part a very common property. That property is that most of the
participants involved exhibit a shallow self-serving personality that borders on
social pathological behavior. To perform or act in such an abysmal way could only
be an act of an idiot. R

Two Simple Examples of MTurk
1. Ask workers to classify a query
2. Ask workers to judge document relevance

Steps
• Define high-level task
• Design & implement interface & backend
• Launch, monitor progress, and assess work
• Iterate design


Query Classification Task
• Ask the user to classify a query
• Show a form that contains a few categories
• Upload a few queries (~20)
• Use 3 workers


DEMO


Relevance Judging Task
• Use a few documents from a standard
collection used for evaluating search engines
• Ask user to make binary judgments
• Modification: graded judging
• Use 5 workers


DEMO


IV
METHODOLOGY FOR EFFECTIVE
CROWDSOURCING

Typical Workflow
• Define and design what to test
• Sample data
• Design the experiment
• Run experiment
• Collect data and analyze results
• Quality control


Development Framework
• Incremental approach
• Measure, evaluate, and adjust as you go
• Suitable for repeatable tasks


Survey Design
• One of the most important parts
• Part art, part science
• Instructions are key
• Prepare to iterate


Questionnaire Design
• Ask the right questions
• Workers may not be IR experts so don’t
assume the same understanding in terms of
terminology
• Show examples
• Hire a technical writer
– Engineer writes the specification
– Writer communicates


UX Design
• Time to apply all those usability concepts
• Generic tips
– Experiment should be self-contained.
– Keep it short and simple. Brief and concise.
– Be very clear with the relevance task.
– Engage with the worker. Avoid boring stuff.
– Always ask for feedback (open-ended question) in
an input box.


UX Design - II
• Presentation
• Document design
• Highlight important concepts
• Colors and fonts
• Need to grab attention
• Localization


Examples - I
• Asking too much, task not clear, “do NOT/reject”
• Worker has to do a lot of stuff


Example - II
• Lot of work for a few cents
• Go here, go there, copy, enter, count …


A Better Example
• All information is available
– What to do
– Search result
– Question to answer


Form and Metadata
• Form with a close question (binary relevance) and
open-ended question (user feedback)
• Clear title, useful keywords
• Workers need to find your task


Relevance Judging – Example I


Relevance Judging – Example II


Implementation
• Similar to a UX
• Build a mock up and test it with your team
– Yes, you need to judge some tasks
• Incorporate feedback and run a test on MTurk
with a very small data set
– Time the experiment
– Do people understand the task?
• Analyze results
– Look for spammers
– Check completion times
• Iterate and modify accordingly

Implementation – II
• Introduce quality control
– Qualification test
– Gold answers (honey pots)
• Adjust passing grade and worker approval rate
• Run experiment with new settings & same data
• Scale on data
• Scale on workers


Experiment in Production
• Lots of tasks on MTurk at any moment
• Need to grab attention
• Importance of experiment metadata
• When to schedule
– Split a large task into batches and have 1 single
batch in the system
– Always review feedback from batch n before
uploading n+1


How Much to Pay?
• Price commensurate with task effort
– Ex: $0.02 for yes/no answer + $0.02 bonus for optional feedback
• Ethics & market-factors: W. Mason and S. Suri, 2010.
– e.g. non-profit SamaSource contracts workers refugee camps
– Predict right price given market & task: Wang et al. CSDM’11
• Uptake & time-to-completion vs. Cost & Quality
– Too little $$, no interest or slow – too much $$, attract spammers
– Real problem is lack of reliable QA substrate
• Accuracy & quantity
– More pay = more work, not better (W. Mason and D. Watts, 2009)
• Heuristics: start small, watch uptake and bargaining feedback
• Worker retention (“anchoring”)
See also: L.B. Chilton et al. KDD-HCOMP 2010.

Quality Control in General
• Extremely important part of the experiment
• Approach as “overall” quality; not just for workers
• Bi-directional channel
– You may think the worker is doing a bad job.
– The same worker may think you are a lousy requester.


When to assess quality of work
• Beforehand (prior to main task activity)
– How: “qualification tests” or similar mechanism
– Purpose: screening, selection, recruiting, training
• During
– How: assess labels as worker produces them
• Like random checks on a manufacturing line
– Purpose: calibrate, reward/penalize, weight
• After
– How: compute accuracy metrics post-hoc
– Purpose: filter, calibrate, weight, retain (HR)
– E.g. Jung & Lease (2011), Tang & Lease (2011), ...

How to assess quality of work?
• Compare worker’s label vs.
– Known (correct, trusted) label
– Other workers’ labels
• P. Ipeirotis. Worker Evaluation in Crowdsourcing: Gold Data or
Multiple Workers? Sept. 2010.
– Model predictions of the above
• Model the labels (Ryu & Lease, ASIS&T11)
• Model the workers (Chen et al., AAAI’10)
• Verify worker’s label
– Yourself
– Tiered approach (e.g. Find-Fix-Verify)
• Quinn and B. Bederson’09, Bernstein et al.’10

Typical Assumptions
• Objective truth exists
– no minority voice / rare insights
– Can relax this to model “truth distribution”
• Automatic answer comparison/evaluation
– What about free text responses? Hope from NLP…
• Automatic essay scoring
• Translation (BLEU: Papineni, ACL’2002)
• Summarization (Rouge: C.Y. Lin, WAS’2004)
– Have people do it (yourself or find-verify crowd, etc.)

Distinguishing Bias vs. Noise
• Ipeirotis (HComp 2010)
• People often have consistent, idiosyncratic
skews in their labels (bias)
– E.g. I like action movies, so they get higher ratings
• Once detected, systematic bias can be
calibrated for and corrected (yeah!)
• Noise, however, seems random & inconsistent
– this is the real issue we want to focus on


Comparing to known answers
• AKA: gold, honey pot, verifiable answer, trap
• Assumes you have known answers
• Cost vs. Benefit
– Producing known answers (experts?)
– % of work spent re-producing them
• Finer points
– Controls against collusion
– What if workers recognize the honey pots?


Comparing to other workers
• AKA: consensus, plurality, redundant labeling
• Well-known metrics for measuring agreement
• Cost vs. Benefit: % of work that is redundant
• Finer points
– Is consensus “truth” or systematic bias of group?
– What if no one really knows what they’re doing?
• Low-agreement across workers indicates problem is with the
task (or a specific example), not the workers
– Risk of collusion
• Sheng et al. (KDD 2008)

Comparing to predicted label
• Ryu & Lease, ASIS&T11 (CrowdConf’11 poster)
• Catch-22 extremes
– If model is really bad, why bother comparing?
– If model is really good, why collect human labels?
• Exploit model confidence
– Trust predictions proportional to confidence
– What if model very confident and wrong?
• Active learning
– Time sensitive: Accuracy / confidence changes

Compare to predicted worker labels
• Chen et al., AAAI’10
• Avoid inefficiency of redundant labeling
– See also: Dekel & Shamir (COLT’2009)
• Train a classifier for each worker
• For each example labeled by a worker
– Compare to predicted labels for all other workers
• Issues
• Sparsity: workers have to stick around to train model…
• Time-sensitivity: New workers & incremental updates?


Methods for measuring agreement
• What to look for
– Agreement, reliability, validity
• Inter-agreement level
– Agreement between judges
– Agreement between judges and the gold set
• Some statistics
– Percentage agreement
– Cohen’s kappa (2 raters)
– Fleiss’ kappa (any number of raters)
– Krippendorff’s alpha
• With majority vote, what if 2 say relevant, 3 say not?
– Use expert to break ties (Kochhar et al, HCOMP’10; GQR)
– Collect more judgments as needed to reduce uncertainty

Inter-rater reliability
• Lots of research
• Statistics books cover most of the material
• Three categories based on the goals
– Consensus estimates
– Consistency estimates
– Measurement estimates


Sample code
– R packages psy and irr
>library(psy)
>library(irr)
>my_data <- read.delim(file="test.txt",
head=TRUE, sep="t")
>kappam.fleiss(my_data,exact=FALSE)

>my_data2 <- read.delim(file="test2.txt",
head=TRUE, sep="t")
>ckappa(my_data2)


k coefficient
• Different interpretations of k
• For practical purposes you need to be >= moderate
• Results may vary
k Interpretation
<0 Poor agreement
0.01 – 0.20 Slight agreement
0.21 – 0.40 Fair agreement
0.41 – 0.60 Moderate agreement
0.61 – 0.80 Substantial agreement
0.81 – 1.00 Almost perfect agreement


Detection Theory
• Sensitivity measures
– High sensitivity: good ability to discriminate
– Low sensitivity: poor ability
Stimulus “Yes” “No”
Class
S1 Hits Misses
S2 False alarms Correct
rejections

Hit rate H = P(“yes”|S2)
False alarm rate F = P(“yes”|S1)


Finding Consensus
• When multiple workers disagree on the
correct label, how do we resolve this?
– Simple majority vote (or average and round)
– Weighted majority vote (e.g. naive bayes)
• Many papers from machine learning…
• If wide disagreement, likely there is a bigger
problem which consensus doesn’t address


Quality Control on MTurk
• Rejecting work & Blocking workers (more later…)
– Requestors don’t want bad PR or complaint emails
– Common practice: always pay, block as needed
• Approval rate: easy to use, but value?
– P. Ipeirotis. Be a Top Mechanical Turk Worker: You Need $5
and 5 Minutes. Oct. 2010
– Many requestors don’t ever reject…
• Qualification test
– Pre-screen workers’ capabilities & effectiveness
– Example and pros/cons in next slides…
• Geographic restrictions
• Mechanical Turk Masters (June 23, 2011)
– Recent addition, degree of benefit TBD…

Tools and Packages for MTurk
• QA infrastructure layers atop MTurk promote
useful separation-of-concerns from task
– TurkIt
• Quik Turkit provides nearly realtime services
– Turkit-online (??)
– Get Another Label (& qmturk)
– Turk Surveyor
– cv-web-annotation-toolkit (image labeling)
– Soylent
– Boto (python library)
• Turkpipe: submit batches of jobs using the command line.
• More needed…

A qualification test snippet
<Question>
<QuestionIdentifier>question1</QuestionIdentifier>
<QuestionContent>
<Text>Carbon monoxide poisoning is</Text>
</QuestionContent>
<AnswerSpecification>
<SelectionAnswer>
<StyleSuggestion>radiobutton</StyleSuggestion>
<Selections>
<Selection>
<SelectionIdentifier>1</SelectionIdentifier>
<Text>A chemical technique</Text>
</Selection>
<Selection>
<Text>A green energy treatment</Text>
</Selection>
<Selection>
<Text>A phenomena associated with sports</Text>
</Selection>
<Selection>
<Text>None of the above</Text>
</Selection>
</Selections>
</SelectionAnswer>
</AnswerSpecification>
</Question> 2011
November 1, Crowdsourcing for Research and Engineering 123

Qualification tests: pros and cons
• Advantages
– Great tool for controlling quality
– Adjust passing grade
• Disadvantages
– Extra cost to design and implement the test
– May turn off workers, hurt completion time
– Refresh the test on a regular basis
– Hard to verify subjective tasks like judging relevance
• Try creating task-related questions to get worker
familiar with task before starting task in earnest

More on quality control & assurance
• HR issues: recruiting, selection, & retention
– e.g., post/tweet, design a better qualification test,
bonuses, …
• Collect more redundant judgments…
– at some point defeats cost savings of
crowdsourcing
– 5 workers is often sufficient


Robots and Captchas
• Some reports of robots on MTurk
– E.g. McCreadie et al. (2011)
– violation of terms of service
– Artificial artificial artificial intelligence
• Captchas seem ideal, but…
– There is abuse of robots using turkers to solve captchas so
they can access web resources
– Turker wisdom is therefore to avoid such HITs
• What to do?
– Use standard captchas, notify workers
– Block robots other ways (e.g. external HITs)
– Catch robots through standard QC, response times
– Use HIT-specific captchas (Kazai et al., 2011)

Was the task difficult?
• Ask workers to rate difficulty of a search topic
• 50 topics; 5 workers, $0.01 per task


Other quality heuristics
• Justification/feedback as quasi-captcha
– Successfully proven in past experiments
– Should be optional
– Automatically verifying feedback was written by a
person may be difficult (classic spam detection task)
• Broken URL/incorrect object
– Leave an outlier in the data set
– Workers will tell you
– If somebody answers “excellent” on a graded
relevance test for a broken URL => probably spammer


Dealing with bad workers
• Pay for “bad” work instead of rejecting it?
– Pro: preserve reputation, admit if poor design at fault
– Con: promote fraud, undermine approval rating system
• Use bonus as incentive
– Pay the minimum $0.01 and $0.01 for bonus
– Better than rejecting a $0.02 task
• If spammer “caught”, block from future tasks
– May be easier to always pay, then block as needed


Worker feedback
• Real feedback received via email after rejection
• Worker XXX
I did. If you read these articles most of them have
nothing to do with space programs. I’m not an idiot.

• Worker XXX
As far as I remember there wasn't an explanation about
what to do when there is no name in the text. I believe I
did write a few comments on that, too. So I think you're
being unfair rejecting my HITs.


Real email exchange with worker after rejection
WORKER: this is not fair , you made me work for 10 cents and i lost my 30 minutes
of time ,power and lot more and gave me 2 rejections at least you may keep it
pending. please show some respect to turkers

REQUESTER: I'm sorry about the rejection. However, in the directions given in the
hit, we have the following instructions: IN ORDER TO GET PAID, you must judge all 5
webpages below *AND* complete a minimum of three HITs.

Unfortunately, because you only completed two hits, we had to reject those hits.
We do this because we need a certain amount of data on which to make decisions
about judgment quality. I'm sorry if this caused any distress. Feel free to contact me
if you have any additional questions or concerns.

WORKER: I understood the problems. At that time my kid was crying and i went to
look after. that's why i responded like that. I was very much worried about a hit
being rejected. The real fact is that i haven't seen that instructions of 5 web page
and started doing as i do the dolores labs hit, then someone called me and i went
to attend that call. sorry for that and thanks for your kind concern.

Exchange with worker
• Worker XXX
Thank you. I will post positive feedback for you at
Turker Nation.

Me: was this a sarcastic comment?

• I took a chance by accepting some of your HITs to see if
you were a trustworthy author. My experience with you
has been favorable so I will put in a good word for you
on that website. This will help you get higher quality
applicants in the future, which will provide higher
quality work, which might be worth more to you, which
hopefully means higher HIT amounts in the future.


Build Your Reputation as a Requestor
• Word of mouth effect
– Workers trust the requester (pay on time, clear
explanation if there is a rejection)
– Experiments tend to go faster
– Announce forthcoming tasks (e.g. tweet)
• Disclose your real identity?


Other practical tips
• Sign up as worker and do some HITs
• “Eat your own dog food”
• Monitor discussion forums
• Address feedback (e.g., poor guidelines,
payments, passing grade, etc.)
• Everything counts!
– Overall design only as strong as weakest link


Content quality
• People like to work on things that they like
• TREC ad-hoc vs. INEX
– TREC experiments took twice to complete
– INEX (Wikipedia), TREC (LA Times, FBIS)
• Topics
– INEX: Olympic games, movies, salad recipes, etc.
– TREC: cosmic events, Schengen agreement, etc.
• Content and judgments according to modern times
– Airport security docs are pre 9/11
– Antarctic exploration (global warming )


Content quality - II
• Document length
• Randomize content
• Avoid worker fatigue
– Judging 100 documents on the same subject can
be tiring, leading to decreasing quality


Presentation
• People scan documents for relevance cues
• Document design
• Highlighting no more than 10%


Presentation - II


Relevance justification
• Why settle for a label?
• Let workers justify answers
– cf. Zaidan et al. (2007) “annotator rationales”
• INEX
– 22% of assignments with comments
• Must be optional
• Let’s see how people justify


“Relevant” answers
[Salad Recipes]
Doesn't mention the word 'salad', but the recipe is one that could be considered a
salad, or a salad topping, or a sandwich spread.
Egg salad recipe
Egg salad recipe is discussed.
History of salad cream is discussed.
Includes salad recipe
It has information about salad recipes.
Potato Salad
Potato salad recipes are listed.
Recipe for a salad dressing.
Salad Recipes are discussed.
Salad cream is discussed.
Salad info and recipe
The article contains a salad recipe.
The article discusses methods of making potato salad.
The recipe is for a dressing for a salad, so the information is somewhat narrow for
the topic but is still potentially relevant for a researcher.
This article describes a specific salad. Although it does not list a specific recipe,
it does contain information relevant to the search topic.
gives a recipe for tuna salad
relevant for tuna salad recipes
relevant to salad recipes
this is on-topic for salad recipes


“Not relevant” answers
[Salad Recipes]
About gaming not salad recipes.
Article is about Norway.
Article is about Region Codes.
Article is about forests.
Article is about geography.
Document is about forest and trees.
Has nothing to do with salad or recipes.
Not a salad recipe
Not about recipes
Not about salad recipes
There is no recipe, just a comment on how salads fit into meal formats.
There is nothing mentioned about salads.
While dressings should be mentioned with salads, this is an article on one specific
type of dressing, no recipe for salads.
article about a swiss tv show
completely off-topic for salad recipes
not a salad recipe
not about salad recipes
totally off base


Feedback length

• Workers will justify answers
• Has to be optional for good
feedback
• In E51, mandatory comments
– Length dropped
– “Relevant” or “Not Relevant


Other design principles
• Text alignment
• Legibility
• Reading level: complexity of words and sentences
• Attractiveness (worker’s attention & enjoyment)
• Multi-cultural / multi-lingual
• Who is the audience (e.g. target worker community)
– Special needs communities (e.g. simple color blindness)
• Parsimony
• Cognitive load: mental rigor needed to perform task
• Exposure effect

Platform alternatives
• Why MTurk
– Amazon brand, lots of research papers
– Speed, price, diversity, payments
• Why not
– Crowdsourcing != Mturk
– Spam, no analytics, must build tools for worker & task quality
• How to build your own crowdsourcing platform
– Back-end
– Template language for creating experiments
– Scheduler
– Payments?


The human side
• As a worker
– I hate when instructions are not clear
– I’m not a spammer – I just don’t get what you want
– Boring task
– A good pay is ideal but not the only condition for engagement
• As a requester
– Attrition
– Balancing act: a task that would produce the right results and
is appealing to workers
– I want your honest answer for the task
– I want qualified workers; system should do some of that for me
• Managing crowds and tasks is a daily activity
– more difficult than managing computers

Things that work
• Qualification tests
• Honey-pots
• Good content and good presentation
• Economy of attention
• Things to improve
– Manage workers in different levels of expertise
including spammers and potential cases.
– Mix different pools of workers based on different
profile and expertise levels.


Things that need work
• UX and guidelines
– Help the worker
– Cost of interaction
• Scheduling and refresh rate
• Exposure effect
• Sometimes we just don’t agree
• How crowdsourcable is your task


V.
FUTURE TRENDS: FROM LABELING TO
HUMAN COMPUTATION

The Turing Test (Alan Turing, 1950)


What is a Computer?


• What was old is new
• “Crowdsourcing: A New
Branch of Computer Science”
(March 29, 2011)

• See also: M. Croarken
(2003), Tabulating the
heavens: computing the
Nautical Almanac in
18th-century England
Princeton University Press, 2005

Davis et al. (2010) The HPU.

HPU


Remembering the Human in HPU
• Not just turning a mechanical crank


Human Computation
Rebirth of people as ‘computists’; people do tasks computers cannot (do well)
Stage 1: Detecting robots
– CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart
– No useful work produced; people just answer questions with known answers

Stage 2: Labeling data (at scale)
– E.g. ESP game, typical use of MTurk
– Game changer for AI: starving for data

Stage 3: General “human computation” (HPU)
– people do arbitrarily sophisticated tasks (i.e. compute arbitrary functions)
– HPU as core component in system architecture, many “HPC” invocations
– blend HPU with automation for a new class of hybrid applications
– New tradeoffs possible in latency/cost vs. functionality/accuracy


Mobile Phone App: “Amazon Remembers”


Soylent: A Word Processor with a Crowd Inside

• Bernstein et al., UIST 2010


CrowdSearch and mCrowd
• T. Yan, MobiSys 2010


Translation by monolingual speakers
• C. Hu, CHI 2009


Wisdom of Crowds (WoC)
Requires
• Diversity
• Independence
• Decentralization
• Aggregation

Input: large, diverse sample
(to increase likelihood of overall pool quality)
Output: consensus or selection (aggregation)

WoC vs. Ensemble Learning
• Combine multiple models to improve performance
over any constituent model
– Can use many weak learners to make a strong one
– Compensate for poor models with extra computation
• Works better with diverse, independent learners
• cf. NIPS 2010-2011 Workshops
– Computational Social Science & the Wisdom of Crowds
• More investigation needed of traditional feature-
based machine learning & ensemble methods for
consensus labeling with crowdsourcing

Unreasonable Effectiveness of Data
• Massive free Web data
changed how we train
learning systems
– Banko and Brill (2001).
Human Language Tech.
– Halevy et al. (2009). IEEE
Intelligent Systems.

• How might access to cheap & plentiful labeled
data change the balance again?

Active Learning
• Minimize number of labels to achieve goal
accuracy rate of classifier
– Select examples to label to maximize learning
• Vijayanarasimhan and Grauman (CVPR 2011)
– Simple margin criteria: select maximally uncertain
examples to label next

– Finding which examples are uncertain can be
computationally intensive (workers have to wait)
– Use locality-sensitive hashing to find uncertain
examples in sub-linear time

Active Learning (2)
• V&G report each learning iteration ~ 75 min
– 15 minutes for model training & selection
– 60 minutes waiting for crowd labels
• Leaving workers idle may lose them, slowing
uptake and completion times
• Keep workers occupied
– Mason and Suri (2010): paid waiting room
– Laws et al. (EMNLP 2011): parallelize labeling and
example selection via producer-consumer model
• Workers consume examples, produce labels
• Model consumes label, produces examples

MapReduce with human computation
• Commonalities
– Large task divided into smaller sub-problems
– Work distributed among worker nodes (workers)
– Collect all answers and combine them
– Varying performance of heterogeneous
CPUs/HPUs
• Variations
– Human response latency / size of “cluster”
– Some tasks are not suitable


A Few Questions
• How should we balance automation vs.
human computation? Which does what?

• Who’s the right person for the job?

• How do we handle complex tasks? Can we
decompose them into smaller tasks? How?


Research problems – operational
• Methodology
– Budget, people, document, queries, presentation,
incentives, etc.
– Scheduling
– Quality
• What’s the best “mix” of HC for a task?
• What are the tasks suitable for HC?
• Can I crowdsource my task?
– Eickhoff and de Vries, WSDM 2011 CSDM Workshop


More problems
• Human factors vs. outcomes
• Editors vs. workers
• Pricing tasks
• Predicting worker quality from observable
properties (e.g. task completion time)
• HIT / Requestor ranking or recommendation
• Expert search : who are the right workers given
task nature and constraints
• Ensemble methods for Crowd Wisdom consensus

Problems: crowds, clouds and algorithms
• Infrastructure
– Current platforms are very rudimentary
– No tools for data analysis
• Dealing with uncertainty (propagate rather than mask)
– Temporal and labeling uncertainty
– Learning algorithms
– Search evaluation
– Active learning (which example is likely to be labeled correctly)
• Combining CPU + HPU
– Human Remote Call?
– Procedural vs. declarative?
– Integration points with enterprise systems

CrowdForge: MapReduce for
Automation + Human Computation

Kittur et al., CHI 2011


Conclusions
• Crowdsourcing works and is here to stay
• Fast turnaround, easy to experiment, cheap
• Still have to design the experiments carefully!
• Usability considerations
• Worker quality
• User feedback extremely useful


Conclusions - II
• Lots of opportunities to improve current platforms
• Integration with current systems
• While MTurk first to-market in micro-task vertical,
many other vendors are emerging with different
affordances or value-added features

• Many open research problems …


Conclusions – III
• Important to know your limitations and be
ready to collaborate
• Lots of different skills and expertise required
– Social/behavioral science
– Human factors
– Algorithms
– Economics
– Distributed systems
– Statistics


VIII
REFERENCES & RESOURCES

Books
• Omar Alonso, Gabriella Kazai, and Stefano
Mizzaro. (2012). Crowdsourcing for Search
Engine Evaluation: Why and How.

• Law and von Ahn (2011).
Human Computation


More Books
July 2010, kindle-only: “This book introduces you to the
top crowdsourcing sites and outlines step by step with
photos the exact process to get started as a requester on
Amazon Mechanical Turk.“


2011 Tutorials and Keynotes
• By Omar Alonso and/or Matthew Lease
– CLEF: Crowdsourcing for Information Retrieval Experimentation and Evaluation (Sep. 20, Omar only)
– CrowdConf (Nov. 1, this is it!)
– IJCNLP: Crowd Computing: Opportunities and Challenges (Nov. 10, Matt only)
– WSDM: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Feb. 9)
– SIGIR: Crowdsourcing for Information Retrieval: Principles, Methods, and Applications (July 24)

• AAAI: Human Computation: Core Research Questions and State of the Art
– Edith Law and Luis von Ahn, August 7
• ASIS&T: How to Identify Ducks In Flight: A Crowdsourcing Approach to Biodiversity Research and
Conservation
– Steve Kelling, October 10, ebird
• EC: Conducting Behavioral Research Using Amazon's Mechanical Turk
– Winter Mason and Siddharth Suri, June 5
• HCIC: Quality Crowdsourcing for Human Computer Interaction Research
– Ed Chi, June 14-18, about HCIC)
– Also see his: Crowdsourcing for HCI Research with Amazon Mechanical Turk
• Multimedia: Frontiers in Multimedia Search
– Alan Hanjalic and Martha Larson, Nov 28
• VLDB: Crowdsourcing Applications and Platforms
– Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska)
• WWW: Managing Crowdsourced Human Computation
– Panos Ipeirotis and Praveen Paritosh

2011 Workshops & Conferences
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)

Things to Come in 2012
• AAAI Symposium: Wisdom of the Crowd
– March 26-28
• Year 2 of TREC Crowdsourcing Track
• Human Computation workshop/conference (TBD)
• Journal Special Issues
– Springer’s Information Retrieval:
Crowdsourcing for Information Retrieval
– Hindawi’s Advances in Multimedia Journal:
Multimedia Semantics Analysis via Crowdsourcing Geocontext
– IEEE Internet Computing: Crowdsourcing (Sept./Oct. 2012)
– IEEE Transactions on Multimedia:
Crowdsourcing in Multimedia (proposal in review)

Thank You!
Crowdsourcing news & information:
ir.ischool.utexas.edu/crowd

For further questions, contact us at:
omar.alonso@microsoft.com
ml@ischool.utexas.edu

Cartoons by Mateo Burtch (buta@sonic.net)

Recent Overview Papers
• Alex Quinn and Ben Bederson. Human
Computation: A Survey and Taxonomy of a
Growing Field. In Proceedings of CHI 2011.
• Man-Ching Yuen, Irwin King, and Kwong-Sak
Leung. A Survey of Crowdsourcing Systems.
SocialCom 2011.
• A. Doan, R. Ramakrishnan, A. Halevy.
Crowdsourcing Systems on the World-Wide
Web. Communications of the ACM, 2011.


Resources
A Few Blogs
 Behind Enemy Lines (P.G. Ipeirotis, NYU)
 Deneme: a Mechanical Turk experiments blog (Gret Little, MIT)
 CrowdFlower Blog
 http://experimentalturk.wordpress.com
 Jeff Howe

A Few Sites
 The Crowdsortium
 Crowdsourcing.org
 CrowdsourceBase (for workers)
 Daily Crowdsource

MTurk Forums and Resources
 Turker Nation: http://turkers.proboards.com
 http://www.turkalert.com (and its blog)
 Turkopticon: report/avoid shady requestors
 Amazon Forum for MTurk

Bibliography
 J. Barr and L. Cabrera. “AI gets a Brain”, ACM Queue, May 2006.
 Bernstein, M. et al. Soylent: A Word Processor with a Crowd Inside. UIST 2010. Best Student Paper award.
 Bederson, B.B., Hu, C., & Resnik, P. Translation by Iteractive Collaboration between Monolingual Users, Proceedings of Graphics
Interface (GI 2010), 39-46.
 N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
 C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP 2009.
 P. Dai, Mausam, and D. Weld. “Decision-Theoretic of Crowd-Sourced Workflows”, AAAI, 2010.
 J. Davis et al. “The HPU”, IEEE Computer Vision and Pattern Recognition Workshop on Advancing Computer Vision with Human
in the Loop (ACVHL), June 2010.
 M. Gashler, C. Giraud-Carrier, T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, ICMLA 2008.
 D. A. Grier. When Computers Were Human. Princeton University Press, 2005. ISBN 0691091579
 JS. Hacker and L. von Ahn. “Matchin: Eliciting User Preferences with an Online Game”, CHI 2009.
 J. Heer, M. Bobstock. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design”, CHI 2010.
 P. Heymann and H. Garcia-Molina. “Human Processing”, Technical Report, Stanford Info Lab, 2010.
 J. Howe. “Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business”. Crown Business, New York, 2008.
 P. Hsueh, P. Melville, V. Sindhwami. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria”. NAACL HLT
Workshop on Active Learning and NLP, 2009.
 B. Huberman, D. Romero, and F. Wu. “Crowdsourcing, attention and productivity”. Journal of Information Science, 2009.
 P.G. Ipeirotis. The New Demographics of Mechanical Turk. March 9, 2010. PDF and Spreadsheet.
 P.G. Ipeirotis, R. Chandrasekar and P. Bennett. Report on the human computation workshop. SIGKDD Explorations v11 no 2 pp. 80-83, 2010.
 P.G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. CeDER-10-04 (Sept. 11, 2010)


Bibliography (2)
 A. Kittur, E. Chi, and B. Suh. “Crowdsourcing user studies with Mechanical Turk”, SIGCHI 2008.
 Aniket Kittur, Boris Smus, Robert E. Kraut. CrowdForge: Crowdsourcing Complex Work. CHI 2011
 Adriana Kovashka and Matthew Lease. “Human and Machine Detection of … Similarity in Art”. CrowdConf 2010.
 K. Krippendorff. "Content Analysis", Sage Publications, 2003
 G. Little, L. Chilton, M. Goldman, and R. Miller. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, HCOMP 2009.
 T. Malone, R. Laubacher, and C. Dellarocas. Harnessing Crowds: Mapping the Genome of Collective Intelligence.
2009.
 W. Mason and D. Watts. “Financial Incentives and the ’Performance of Crowds’”, HCOMP Workshop at KDD 2009.
 J. Nielsen. “Usability Engineering”, Morgan-Kaufman, 1994.
 A. Quinn and B. Bederson. “A Taxonomy of Distributed Human Computation”, Technical Report HCIL-2009-23, 2009
 J. Ross, L. Irani, M. Six Silberman, A. Zaldivar, and B. Tomlinson. “Who are the Crowdworkers?: Shifting
Demographics in Amazon Mechanical Turk”. CHI 2010.
 F. Scheuren. “What is a Survey” (http://www.whatisasurvey.info) 2004.
 R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. “Cheap and Fast But is it Good? Evaluating Non-Expert Annotations
for Natural Language Tasks”. EMNLP-2008.
 V. Sheng, F. Provost, P. Ipeirotis. “Get Another Label? Improving Data Quality … Using Multiple, Noisy Labelers”
KDD 2008.
 S. Weber. “The Success of Open Source”, Harvard University Press, 2004.
 L. von Ahn. Games with a purpose. Computer, 39 (6), 92–94, 2006.
 L. von Ahn and L. Dabbish. “Designing Games with a purpose”. CACM, Vol. 51, No. 8, 2008.


Bibliography (3)
 Shuo Chen et al. What if the Irresponsible Teachers Are Dominating? A Method of Training on Samples and
Clustering on Teachers. AAAI 2010.
 Paul Heymann, Hector Garcia-Molina: Turkalytics: analytics for human computation. WWW 2011.
 Florian Laws, Christian Scheible and Hinrich Schütze. Active Learning with Amazon Mechanical Turk.
EMNLP 2011.
 C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text
summarization branches out (WAS), 2004.
 C. Marshall and F. Shipman “The Ownership and Reuse of Visual Media”, JCDL, 2011.
 Hohyon Ryu and Matthew Lease. Crowdworker Filtering with Support Vector Machine. ASIS&T 2011.
 Wei Tang and Matthew Lease. Semi-Supervised Consensus Labeling for Crowdsourcing. ACM SIGIR
Workshop on Crowdsourcing for Information Retrieval (CIR), 2011.
 S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with
Crawled Data and Crowds. CVPR 2011.
 Stephen Wolfson and Matthew Lease. Look Before You Leap: Legal Pitfalls of Crowdsourcing. ASIS&T 2011.


Crowdsourcing in IR: 2008-2010
 2008
 O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing for relevance evaluation”, SIGIR Forum, Vol. 42, No. 2.

 2009
 O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for … Assessment”. SIGIR Workshop on the Future of IR Evaluation.
 P.N. Bennett, D.M. Chickering, A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. WWW.
 G. Kazai, N. Milic-Frayling, and J. Costello. “Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments”, SIGIR.
 G. Kazai and N. Milic-Frayling. “… Quality of Relevance Assessments Collected through Crowdsourcing”. SIGIR Workshop on the Future of IR Evaluation.
 Law et al. “SearchWar”. HCOMP.
 H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. “Improving Search Engines Using Human Computation Games”, CIKM 2009.

 2010
 SIGIR Workshop on Crowdsourcing for Search Evaluation.
 O. Alonso, R. Schenkel, and M. Theobald. “Crowdsourcing Assessments for XML Ranked Retrieval”, ECIR.
 K. Berberich, S. Bedathur, O. Alonso, G. Weikum “A Language Modeling Approach for Temporal Information Needs”, ECIR.
 C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT Workshop on … Amazon's Mechanical Turk.
 Grace Hui Yang, Anton Mityagin, Krysta M. Svore, and Sergey Markov . “Collecting High Quality Overlapping Labels at Low Cost”. SIGIR.
 G. Kazai. “An Exploration of the Influence that Task Parameters Have on the Performance of Crowds”. CrowdConf.
 G. Kazai. “… Crowdsourcing in Building an Evaluation Platform for Searching Collections of Digitized Books”., Workshop on Very Large Digital Libraries (VLDL)
 Stephanie Nowak and Stefan Ruger. How Reliable are Annotations via Crowdsourcing? MIR.
 Jean-François Paiement, Dr. James G. Shanahan, and Remi Zajac. “Crowdsourcing Local Search Relevance”. CrowdConf.
 Maria Stone and Omar Alonso. “A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation”. CrowdConf.
 T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones. MobiSys pp. 77--90, 2010.


Crowdsourcing in IR: 2011
 WSDM Workshop on Crowdsourcing for Search and Data Mining.
 SIGIR Workshop on Crowdsourcing for Information Retrieval.

 O. Alonso and R. Baeza-Yates. “Design and Implementation of Relevance Assessments using Crowdsourcing, ECIR 2011.
 Roi Blanco, Harry Halpin, Daniel Herzig, Peter Mika, Jeffrey Pound, Henry Thompson, Thanh D. Tran. “Repeatable and
Reliable Search System Evaluation using Crowd-Sourcing”. SIGIR 2011.
 Yen-Ta Huang, An-Jung Cheng, Liang-Chi Hsieh, Winston H. Hsu, Kuo-Wei Chang. “Region-Based Landmark Discovery by
Crowdsourcing Geo-Referenced Photos.” SIGIR 2011.
 Hyun Joon Jung, Matthew Lease . “Improving Consensus Accuracy via Z-score and Weighted Voting”. HCOMP 2011.
 G. Kasneci, J. Van Gael, D. Stern, and T. Graepel, CoBayes: Bayesian Knowledge Corroboration with Assessors of
Unknown Areas of Expertise, WSDM 2011.
 Gabriella Kazai,. “In Search of Quality in Crowdsourcing for Search Engine Evaluation”, ECIR 2011.
 Gabriella Kazai, Jaap Kamps, Marijn Koolen, Natasa Milic-Frayling. “Crowdsourcing for Book Search Evaluation: Impact of Quality
on Comparative System Ranking.” SIGIR 2011.
 Abhimanu Kumar, Matthew Lease . “Learning to Rank From a Noisy Crowd”. SIGIR 2011.
 Edith Law, Paul N. Bennett, and Eric Horvitz. “The Effects of Choice in Routing Relevance Judgments”. SIGIR 2011.


Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)

Similar to Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011) (20)

More from Matthew Lease

More from Matthew Lease (18)

Recently uploaded

Recently uploaded (20)

Crowdsourcing For Research and Engineering (Tutorial given at CrowdConf 2011)