Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crowdsourcing big data_industry_jun-25-2015_for_slideshare

A presentation to government officials doing crowdsourcing and citizen science. What can machine learning techniques and industry use cases do to help get the most out of data (and big data).

  • Be the first to comment

Crowdsourcing big data_industry_jun-25-2015_for_slideshare

  1. 1. FCPCCS - Big Data and Crowdsourcing Pattern-recognition and the crowd
  2. 2. FCPCCS - Big Data and Crowdsourcing What would you do with unlimited human analysts?
  3. 3. FCPCCS - Big Data and Crowdsourcing
  4. 4. FCPCCS - Big Data and Crowdsourcing People DataCategories
  5. 5. FCPCCS - Big Data and Crowdsourcing Models
  6. 6. FCPCCS - Big Data and Crowdsourcing
  7. 7. FCPCCS - Big Data and Crowdsourcing
  8. 8. FCPCCS - Big Data and Crowdsourcing Unstructured data gets structured (bonus: a system that gets smarter over time) Adaptive System Machine Learning Optimization Human Annotation Prediction Engine Structured Data Reports Action
  9. 9. FCPCCS - Big Data and Crowdsourcing 80% 85% 99% 83% 81% 88% 87% 90% 73% 91% 0% 50% 100% News Category 4 News Category 2 News Category 1 Manufacturing Health Sciences Finding Relevant News Articles % analyst time saved % accuracy (compared to humans) Efficiency of human time is a major benefit
  10. 10. FCPCCS - Big Data and Crowdsourcing
  11. 11. FCPCCS - Big Data and Crowdsourcing
  12. 12. FCPCCS - Big Data and Crowdsourcing The importance of definition • If people can’t agree on what’s-in and what’s-out, it’s hard to train a machine
  13. 13. FCPCCS - Big Data and Crowdsourcing
  14. 14. FCPCCS - Big Data and Crowdsourcing Wait a sec! Aren’t these ducks? (Can we agree to disagree?)
  15. 15. FCPCCS - Big Data and Crowdsourcing The importance of definition • If people can’t agree on what’s-in and what’s-out, it’s hard to train a machine • In our case toxicity was defined as: • ad hominem attacks (directed at specific people) • bigoted comments (e.g., sexist, racist, homophobic, etc) • Set definitions • Then see if people are consistent • Run pilots • Do inter-annotator agreement • Iterate
  16. 16. FCPCCS - Big Data and Crowdsourcing Inter-annotator agreement: is everyone measuring the same way?
  17. 17. FCPCCS - Big Data and Crowdsourcing Quick recommendation for inter-annotator agreement • You can measure consistency, probably the best way is Krippendorff’s alpha • Don’t use percentage agreement! Particularly when data are skewed towards one category. • If 95% of the data fall under one category label, then random coding would still have two people agree so much that % agreement would make you think you had a reliable study (even though you wouldn’t) • And you can ALSO use models to check these things
  18. 18. FCPCCS - Big Data and Crowdsourcing Finding healthy communities (supportive)
  19. 19. FCPCCS - Big Data and Crowdsourcing And unhealthy ones (toxic)
  20. 20. FCPCCS - Big Data and Crowdsourcing
  21. 21. FCPCCS - Big Data and Crowdsourcing Collect data and annotations—then interrogate it Human annotations Which people/categories should we be wary of? Which annotations do we select to train a model with? A classifier that can predict unseen data
  22. 22. FCPCCS - Big Data and Crowdsourcing Routing messages that matter
  23. 23. FCPCCS - Big Data and Crowdsourcing Processing millions of SMS in 12 African languages Intent of sender (i.e. report a problem, ask a question or make a suggestion) Categorization (i.e. orphans and vulnerable children, violence against children, health, nutrition) Language detection (i.e. English, Acholi, Karamojong, Luganda, Nkole, Swahili, Lango) Location (i.e. village names)
  24. 24. FCPCCS - Big Data and Crowdsourcing
  25. 25. FCPCCS - Big Data and Crowdsourcing 1.4%
  26. 26. FCPCCS - Big Data and Crowdsourcing
  27. 27. FCPCCS - Big Data and Crowdsourcing Top 3 categories in Nigeria 9.69% 17.68% 39.44% Employment U-report support Health
  28. 28. FCPCCS - Big Data and Crowdsourcing The Donald Rumsfeld Question
  29. 29. FCPCCS - Big Data and Crowdsourcing How do I find what I don’t know I don’t know?
  30. 30. FCPCCS - Big Data and Crowdsourcing Negative topics in Walmart employee reviews Hours/Benefits 968 518 Management 2,404 Work/life balance 1,241 Company Values Dealing With Customers 658 Training & Expectation 968 Low Pay 1,446
  31. 31. FCPCCS - Big Data and Crowdsourcing Common Pros among Employees Common Cons among Employees 37% 25% 24% 41% 27% 17% 0% 10% 20% 30% 40% 50% Current Former 24% 16% 13% 13%14% 16% 12% 0% 10% 20% 30% Current Former Structuring unstructured data lets you combine it with other metadata
  32. 32. FCPCCS - Big Data and Crowdsourcing Question: What improves models the most?
  33. 33. FCPCCS - Big Data and Crowdsourcing Instead of worrying about the algorithms in the machine
  34. 34. FCPCCS - Big Data and Crowdsourcing It’s almost always better to just get more pandas
  35. 35. FCPCCS - Big Data and Crowdsourcing How else do you verify?  We assess model accuracy using cross-validation.  Instead of using all annotated data to train a model, you hold out a random 10% and build the model with the rest.  Then you predict against that 10%. You do this 10 times and average the accuracy.  Precision measures “if we automatically label something as X, how often are we right?”  Recall measures “how much of stuff that SHOULD have label X are actually given label X?”
  36. 36. FCPCCS - Big Data and Crowdsourcing The system gets smarter  Here’s what happens going across the first 2,543 annotations on one REALLY low signal classification task  By 9,744 annotations, our accuracy is 97%
  37. 37. FCPCCS - Big Data and Crowdsourcing Other tasks are more straight-forward 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 50 100 150 200 F-score Number of paragraphs annotated F-scores go up with more annotations Disease Country Reported_deaths Reported_cases Date Issue Location People affected # of deaths Event date
  38. 38. FCPCCS - Big Data and Crowdsourcing Project workflow Phase 1:Data • Data capture, normalization and loading Phase 2:Discovery • Topic discovery • Category creation • Expert data annotation • Category verification Phase 3:Training • Guideline creation • Annotator validation • Model training Phase 4: Optimization • Model evaluation • Category refinement Phase 5:Model Deployment • Full system integration • Model performance • Metrics reporting
  39. 39. FCPCCS - Big Data and Crowdsourcing email tyler@idibon.com twitter @idibon www idibon.com THANK YOU!

×