Crowdsourcing big data_industry_jun-25-2015_for_slideshare

FCPCCS - Big Data and Crowdsourcing
Pattern-recognition and the
crowd

What would you do with unlimited human analysts?

People
DataCategories

Models

Unstructured data gets structured (bonus: a
system that gets smarter over time)
Adaptive System
Machine
Learning
Optimization
Human
Annotation
Prediction
Engine
Structured Data Reports
Action

80%
85%
99%
83%
81%
88%
87%
90%
73%
91%
0% 50% 100%
News Category 4
News Category 2
News Category 1
Manufacturing
Health Sciences
Finding Relevant News Articles
% analyst time saved
% accuracy
(compared to
humans)
Efficiency of human time is a major benefit

The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine

Wait a sec! Aren’t these ducks?
(Can we agree to disagree?)

The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine
• In our case toxicity was defined as:
• ad hominem attacks (directed at specific people)
• bigoted comments (e.g., sexist, racist, homophobic, etc)
• Set definitions
• Then see if people are consistent
• Run pilots
• Do inter-annotator agreement
• Iterate

Inter-annotator agreement: is everyone
measuring the same way?

Quick recommendation for inter-annotator
agreement
• You can measure consistency, probably the best way is
Krippendorff’s alpha
• Don’t use percentage agreement! Particularly when data are
skewed towards one category.
• If 95% of the data fall under one category label, then random
coding would still have two people agree so much that %
agreement would make you think you had a reliable study
(even though you wouldn’t)
• And you can ALSO use models to check these things

Finding healthy communities (supportive)

And unhealthy ones (toxic)

Collect data and annotations—then interrogate it
Human annotations
Which
people/categories
should we be wary
of?
Which annotations
do we select to train
a model with?
A classifier
that can
predict
unseen data

Routing messages that matter

Processing millions of SMS in 12 African languages
Intent of sender
(i.e. report a problem, ask
a question or make a
suggestion)
Categorization
(i.e. orphans and
vulnerable children,
violence against children,
health, nutrition)
Language detection
(i.e. English, Acholi,
Karamojong, Luganda,
Nkole, Swahili, Lango)
Location
(i.e. village names)

1.4%

Top 3 categories in Nigeria
9.69%
17.68%
39.44%
Employment
U-report support
Health

The Donald Rumsfeld Question

How do I find what I don’t know I don’t
know?

Negative topics in Walmart employee reviews
Hours/Benefits
968
518
Management
2,404
Work/life balance
1,241
Company Values Dealing With
Customers
658
Training &
Expectation
968
Low Pay
1,446

Common Pros among
Employees
Common Cons among Employees
37%
25% 24%
41%
27%
17%
0%
10%
20%
30%
40%
50%
Current
Former
24%
16%
13% 13%14%
16%
12%
0%
10%
20%
30%
Current
Former
Structuring unstructured data lets you combine it
with other metadata

Question: What improves models the
most?

Instead of worrying about the algorithms
in the machine

It’s almost always better to just get more
pandas

How else do you verify?
 We assess model accuracy using cross-validation.
 Instead of using all annotated data to train a model, you hold out a
random 10% and build the model with the rest.
 Then you predict against that 10%. You do this 10 times and average
the accuracy.
 Precision measures “if we automatically label something as
X, how often are we right?”
 Recall measures “how much of stuff that SHOULD have label
X are actually given label X?”

The system gets smarter
 Here’s what happens going across the first 2,543
annotations on one REALLY low signal classification task
 By 9,744 annotations, our accuracy is 97%

Other tasks are more straight-forward
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
50 100 150 200
F-score
Number of paragraphs annotated
F-scores go up with more annotations
Disease
Country
Reported_deaths
Reported_cases
Date
Issue
Location
People affected
# of deaths
Event date

Project workflow
Phase 1:Data
• Data capture,
normalization and
loading
Phase 2:Discovery
• Topic discovery
• Category creation
• Expert data
annotation
• Category
verification
Phase 3:Training
• Guideline creation
• Annotator
validation
• Model training
Phase 4:
Optimization
• Model evaluation
• Category
refinement
Phase 5:Model
Deployment
• Full system
integration
• Model
performance
• Metrics reporting

email tyler@idibon.com
twitter @idibon
www idibon.com
THANK YOU!

Crowdsourcing big data_industry_jun-25-2015_for_slideshare

Recommended

Recommended

More Related Content

Similar to Crowdsourcing big data_industry_jun-25-2015_for_slideshare

Similar to Crowdsourcing big data_industry_jun-25-2015_for_slideshare (20)

More from Tyler Schnoebelen

More from Tyler Schnoebelen (9)

Recently uploaded

Recently uploaded (20)

Crowdsourcing big data_industry_jun-25-2015_for_slideshare

Editor's Notes