A presentation to government officials doing crowdsourcing and citizen science. What can machine learning techniques and industry use cases do to help get the most out of data (and big data).
8. FCPCCS - Big Data and Crowdsourcing
Unstructured data gets structured (bonus: a
system that gets smarter over time)
Adaptive System
Machine
Learning
Optimization
Human
Annotation
Prediction
Engine
Structured Data Reports
Action
9. FCPCCS - Big Data and Crowdsourcing
80%
85%
99%
83%
81%
88%
87%
90%
73%
91%
0% 50% 100%
News Category 4
News Category 2
News Category 1
Manufacturing
Health Sciences
Finding Relevant News Articles
% analyst time saved
% accuracy
(compared to
humans)
Efficiency of human time is a major benefit
14. FCPCCS - Big Data and Crowdsourcing
Wait a sec! Aren’t these ducks?
(Can we agree to disagree?)
15. FCPCCS - Big Data and Crowdsourcing
The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine
• In our case toxicity was defined as:
• ad hominem attacks (directed at specific people)
• bigoted comments (e.g., sexist, racist, homophobic, etc)
• Set definitions
• Then see if people are consistent
• Run pilots
• Do inter-annotator agreement
• Iterate
16. FCPCCS - Big Data and Crowdsourcing
Inter-annotator agreement: is everyone
measuring the same way?
17. FCPCCS - Big Data and Crowdsourcing
Quick recommendation for inter-annotator
agreement
• You can measure consistency, probably the best way is
Krippendorff’s alpha
• Don’t use percentage agreement! Particularly when data are
skewed towards one category.
• If 95% of the data fall under one category label, then random
coding would still have two people agree so much that %
agreement would make you think you had a reliable study
(even though you wouldn’t)
• And you can ALSO use models to check these things
18. FCPCCS - Big Data and Crowdsourcing
Finding healthy communities (supportive)
19. FCPCCS - Big Data and Crowdsourcing
And unhealthy ones (toxic)
21. FCPCCS - Big Data and Crowdsourcing
Collect data and annotations—then interrogate it
Human annotations
Which
people/categories
should we be wary
of?
Which annotations
do we select to train
a model with?
A classifier
that can
predict
unseen data
22. FCPCCS - Big Data and Crowdsourcing
Routing messages that matter
23. FCPCCS - Big Data and Crowdsourcing
Processing millions of SMS in 12 African languages
Intent of sender
(i.e. report a problem, ask
a question or make a
suggestion)
Categorization
(i.e. orphans and
vulnerable children,
violence against children,
health, nutrition)
Language detection
(i.e. English, Acholi,
Karamojong, Luganda,
Nkole, Swahili, Lango)
Location
(i.e. village names)
27. FCPCCS - Big Data and Crowdsourcing
Top 3 categories in Nigeria
9.69%
17.68%
39.44%
Employment
U-report support
Health
28. FCPCCS - Big Data and Crowdsourcing
The Donald Rumsfeld Question
29. FCPCCS - Big Data and Crowdsourcing
How do I find what I don’t know I don’t
know?
30. FCPCCS - Big Data and Crowdsourcing
Negative topics in Walmart employee reviews
Hours/Benefits
968
518
Management
2,404
Work/life balance
1,241
Company Values Dealing With
Customers
658
Training &
Expectation
968
Low Pay
1,446
31. FCPCCS - Big Data and Crowdsourcing
Common Pros among
Employees
Common Cons among Employees
37%
25% 24%
41%
27%
17%
0%
10%
20%
30%
40%
50%
Current
Former
24%
16%
13% 13%14%
16%
12%
0%
10%
20%
30%
Current
Former
Structuring unstructured data lets you combine it
with other metadata
32. FCPCCS - Big Data and Crowdsourcing
Question: What improves models the
most?
33. FCPCCS - Big Data and Crowdsourcing
Instead of worrying about the algorithms
in the machine
34. FCPCCS - Big Data and Crowdsourcing
It’s almost always better to just get more
pandas
35. FCPCCS - Big Data and Crowdsourcing
How else do you verify?
We assess model accuracy using cross-validation.
Instead of using all annotated data to train a model, you hold out a
random 10% and build the model with the rest.
Then you predict against that 10%. You do this 10 times and average
the accuracy.
Precision measures “if we automatically label something as
X, how often are we right?”
Recall measures “how much of stuff that SHOULD have label
X are actually given label X?”
36. FCPCCS - Big Data and Crowdsourcing
The system gets smarter
Here’s what happens going across the first 2,543
annotations on one REALLY low signal classification task
By 9,744 annotations, our accuracy is 97%
37. FCPCCS - Big Data and Crowdsourcing
Other tasks are more straight-forward
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
50 100 150 200
F-score
Number of paragraphs annotated
F-scores go up with more annotations
Disease
Country
Reported_deaths
Reported_cases
Date
Issue
Location
People affected
# of deaths
Event date
38. FCPCCS - Big Data and Crowdsourcing
Project workflow
Phase 1:Data
• Data capture,
normalization and
loading
Phase 2:Discovery
• Topic discovery
• Category creation
• Expert data
annotation
• Category
verification
Phase 3:Training
• Guideline creation
• Annotator
validation
• Model training
Phase 4:
Optimization
• Model evaluation
• Category
refinement
Phase 5:Model
Deployment
• Full system
integration
• Model
performance
• Metrics reporting
39. FCPCCS - Big Data and Crowdsourcing
email tyler@idibon.com
twitter @idibon
www idibon.com
THANK YOU!
This is the basic stuff you want. (It’s a little self-serving because Idibon’s adaptive system is what makes us special but we really do believe that optimizing training on relevant data with meaningful categories is THE way to deliver business value.)
By using computers to create an initial understanding of data and elevate specific cases for Human Annotation, we use computers to make human decisions smarter, and humans to make computer decisions smarter. Our system optimizes work by using cutting edge Machine Learning that improves accuracy and learns iteratively. Our Prediction Engine provides initial conclusions for further evaluation by human analysts and is also what allows us to scale ten of millions messages a day. Our Optimization process teaches our algorithm what results to select for, essentially refining its accuracy. The key take away here is that we optimize for human analysts time; we can cluster data initially and automatically, then we can escalate specific cases to human annotation. Much of the learning is unsupervised and therefore faster, cheaper and actually more accurate.
After iterations in our adaptive system, previously unstructured data is now structured.
This structured data can be delivered in different outputs, including CSV file exports for your analysts to build reports or direct routing to customer service agents to take action.
As you can see—different categories have different results.
News category 1 is awesome—you really don’t have to show human analysts much data to get all the Relevant stuff (you show them 10% of the data and still get 99% of what the client cares about)
Manufacturing is less awesome. You can reduce your workload to just 73% of what it was…but you have to accept that you’ll only get 83% of the stuff you care about (you’ll miss 17%). If you want to get more like 90% accuracy, you need to review more documents. You “only” get a workload reduction of ~56%.
Ideally, you want a system that gets better over time.
First case study!
http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
Lately, Reddit has gotten a lot of press for having terrible, awful communities
See also http://cswww.essex.ac.uk/Research/nle/arrau/icagr.pdf
The important thing is having definitions people will agree with and can be consistent with…and which actually answer organizational objectives. Do you care about whether duck decoys and/or rubber duckies are ducks or not? WHY?
http://blog.ioactive.com/2013/05/security-101-machine-learning-and-big.html
The trickiest thing about ad hominem attacks as a definition is: what to do with trash talk in sports/gaming. Tricky!
The trickiest thing about ad hominem attacks as a definition is: what to do with trash talk in sports/gaming. Tricky!
This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
The DIY (do it yourself) group is the one that is most supportive and least toxic. This data ties to actual upvote/downvote behavior. Meaning that you’re not actually a supportive community if everyone down votes the supportive comments, nor are you a toxic community if everyone downvotes the toxic comments.
This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
It’s only when everyone upvotes toxic comments that you are a toxic community by our definition here.
We also specifically looked at bigotry.
Indeed, /r/TheRedPill, is seen as the most bigoted. It’s a subreddit dedicated to proud male chauvinism.
This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
Case study three:
http://idibon.com/idibon-supports-unicef-provide-natural-language-processing-sms-based-social-monitoring-systems-africa/
Photo: http://unicefaids.tumblr.com/post/37835112363/photo-young-people-in-kitwe-zambia-explore-the
The United Nations Children’s Fund (UNICEF) is a United Nations branch that provides long-term humanitarian and developmental assistance to children and mothers in developing countries. Idibon provides scalable natural language processing and analytics to UNICEF’s multinational U-report applications, enabling UNICEF to process text messages sent from citizens in Uganda and Nigeria “to better understand and empower marginalized communities that are often excluded due to language barriers.” (Evan Wheeler, CTO of UNICEF’s Global Innovation Centre)
UNICEF U-report only has six dedicated analysts to process and respond to millions of messages a month and Idibon’s technology enables the organization to operate efficiently and at scale.
Specifically, Idibon processes each SMS in four ways:
Intent of sender – to prioritize support/services (UNICEF receives more than a million messages a month and can only respond to about a thousand)
Categorization – to prioritize support/services and to route to appropriate analyst
Language detection – to route to appropriate analyst
Location – to identify where to send support/services
Press release: http://unicefstories.org/2015/02/09/idibon-supports-unicef-to-provide-natural-language-processing-to-sms-based-social-monitoring-systems-in-africa/
Environment is an important issue.
But it looks to be about 1.4% of the data…which means you do have to get enough data to build a model. Note that different countries/languages talk about the environment differently (Uganda=droughts, cows; Nigeria: oil). So you may have more or less heterogeneity in your rarer categories.
Image from http://www.theatlantic.com/photo/2011/06/nigeria-the-cost-of-oil/100082/
For more recent news: http://www.theguardian.com/environment/2015/jan/07/niger-delta-communities-to-sue-shell-in-london-for-oil-spill-compensation
“Environment” is clearly an important issue in Nigeria but only 1.4% of the messages are classified that way.
(One other thing: high/low percentages don’t necessarily correspond to personal or societal importance.)
Each needle found makes the next one easier to find, buuuuuuut some things you want to find are just too rare. You can’t model things that aren’t in the data.
At UNICEF, different people care about different categories—the people who respond to rumors of ebola outbreaks or cures are different than the people trying to keep track of economic issues.
Most actionable is, of course, finding people who specifically require support about participating in the community.
Pay and Opportunities are much less of a pro once employees have left Walmart and becomes more of a con
Management is highly criticised amongst both current and former
9,744 annotations total
951 for engageable
8793 for irrelevant