SlideShare a Scribd company logo
1 of 39
FCPCCS - Big Data and Crowdsourcing
Pattern-recognition and the
crowd
FCPCCS - Big Data and Crowdsourcing
What would you do with unlimited human analysts?
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
People
DataCategories
FCPCCS - Big Data and Crowdsourcing
Models
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Unstructured data gets structured (bonus: a
system that gets smarter over time)
Adaptive System
Machine
Learning
Optimization
Human
Annotation
Prediction
Engine
Structured Data Reports
Action
FCPCCS - Big Data and Crowdsourcing
80%
85%
99%
83%
81%
88%
87%
90%
73%
91%
0% 50% 100%
News Category 4
News Category 2
News Category 1
Manufacturing
Health Sciences
Finding Relevant News Articles
% analyst time saved
% accuracy
(compared to
humans)
Efficiency of human time is a major benefit
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Wait a sec! Aren’t these ducks?
(Can we agree to disagree?)
FCPCCS - Big Data and Crowdsourcing
The importance of definition
• If people can’t agree on what’s-in and what’s-out, it’s
hard to train a machine
• In our case toxicity was defined as:
• ad hominem attacks (directed at specific people)
• bigoted comments (e.g., sexist, racist, homophobic, etc)
• Set definitions
• Then see if people are consistent
• Run pilots
• Do inter-annotator agreement
• Iterate
FCPCCS - Big Data and Crowdsourcing
Inter-annotator agreement: is everyone
measuring the same way?
FCPCCS - Big Data and Crowdsourcing
Quick recommendation for inter-annotator
agreement
• You can measure consistency, probably the best way is
Krippendorff’s alpha
• Don’t use percentage agreement! Particularly when data are
skewed towards one category.
• If 95% of the data fall under one category label, then random
coding would still have two people agree so much that %
agreement would make you think you had a reliable study
(even though you wouldn’t)
• And you can ALSO use models to check these things
FCPCCS - Big Data and Crowdsourcing
Finding healthy communities (supportive)
FCPCCS - Big Data and Crowdsourcing
And unhealthy ones (toxic)
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Collect data and annotations—then interrogate it
Human annotations
Which
people/categories
should we be wary
of?
Which annotations
do we select to train
a model with?
A classifier
that can
predict
unseen data
FCPCCS - Big Data and Crowdsourcing
Routing messages that matter
FCPCCS - Big Data and Crowdsourcing
Processing millions of SMS in 12 African languages
Intent of sender
(i.e. report a problem, ask
a question or make a
suggestion)
Categorization
(i.e. orphans and
vulnerable children,
violence against children,
health, nutrition)
Language detection
(i.e. English, Acholi,
Karamojong, Luganda,
Nkole, Swahili, Lango)
Location
(i.e. village names)
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
1.4%
FCPCCS - Big Data and Crowdsourcing
FCPCCS - Big Data and Crowdsourcing
Top 3 categories in Nigeria
9.69%
17.68%
39.44%
Employment
U-report support
Health
FCPCCS - Big Data and Crowdsourcing
The Donald Rumsfeld Question
FCPCCS - Big Data and Crowdsourcing
How do I find what I don’t know I don’t
know?
FCPCCS - Big Data and Crowdsourcing
Negative topics in Walmart employee reviews
Hours/Benefits
968
518
Management
2,404
Work/life balance
1,241
Company Values Dealing With
Customers
658
Training &
Expectation
968
Low Pay
1,446
FCPCCS - Big Data and Crowdsourcing
Common Pros among
Employees
Common Cons among Employees
37%
25% 24%
41%
27%
17%
0%
10%
20%
30%
40%
50%
Current
Former
24%
16%
13% 13%14%
16%
12%
0%
10%
20%
30%
Current
Former
Structuring unstructured data lets you combine it
with other metadata
FCPCCS - Big Data and Crowdsourcing
Question: What improves models the
most?
FCPCCS - Big Data and Crowdsourcing
Instead of worrying about the algorithms
in the machine
FCPCCS - Big Data and Crowdsourcing
It’s almost always better to just get more
pandas
FCPCCS - Big Data and Crowdsourcing
How else do you verify?
 We assess model accuracy using cross-validation.
 Instead of using all annotated data to train a model, you hold out a
random 10% and build the model with the rest.
 Then you predict against that 10%. You do this 10 times and average
the accuracy.
 Precision measures “if we automatically label something as
X, how often are we right?”
 Recall measures “how much of stuff that SHOULD have label
X are actually given label X?”
FCPCCS - Big Data and Crowdsourcing
The system gets smarter
 Here’s what happens going across the first 2,543
annotations on one REALLY low signal classification task
 By 9,744 annotations, our accuracy is 97%
FCPCCS - Big Data and Crowdsourcing
Other tasks are more straight-forward
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
50 100 150 200
F-score
Number of paragraphs annotated
F-scores go up with more annotations
Disease
Country
Reported_deaths
Reported_cases
Date
Issue
Location
People affected
# of deaths
Event date
FCPCCS - Big Data and Crowdsourcing
Project workflow
Phase 1:Data
• Data capture,
normalization and
loading
Phase 2:Discovery
• Topic discovery
• Category creation
• Expert data
annotation
• Category
verification
Phase 3:Training
• Guideline creation
• Annotator
validation
• Model training
Phase 4:
Optimization
• Model evaluation
• Category
refinement
Phase 5:Model
Deployment
• Full system
integration
• Model
performance
• Metrics reporting
FCPCCS - Big Data and Crowdsourcing
email tyler@idibon.com
twitter @idibon
www idibon.com
THANK YOU!

More Related Content

Similar to Crowdsourcing big data_industry_jun-25-2015_for_slideshare

Amazon Machine Learning for Developers
Amazon Machine Learning for DevelopersAmazon Machine Learning for Developers
Amazon Machine Learning for DevelopersAmazon Web Services
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
 
Healthcare Analytics Summit Keynote Fall 2017
Healthcare Analytics Summit Keynote Fall 2017Healthcare Analytics Summit Keynote Fall 2017
Healthcare Analytics Summit Keynote Fall 2017Dale Sanders
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...Health Catalyst
 
David Cocker big data MDCPartners ta-scan
David Cocker big data MDCPartners ta-scanDavid Cocker big data MDCPartners ta-scan
David Cocker big data MDCPartners ta-scanDavid Cocker
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurneyCertus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurneyCertus Solutions
 
Bio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
Bio IT World 2019 - AI For Healthcare - Simon Taylor, LucidworksBio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
Bio IT World 2019 - AI For Healthcare - Simon Taylor, LucidworksLucidworks
 
COVID-19 - How to Improve Outcomes By Improving Data
COVID-19 - How to Improve Outcomes By Improving DataCOVID-19 - How to Improve Outcomes By Improving Data
COVID-19 - How to Improve Outcomes By Improving Data303Computing
 
AI/ML Webinar - Improve Public Health
AI/ML Webinar - Improve Public HealthAI/ML Webinar - Improve Public Health
AI/ML Webinar - Improve Public HealthAmazon Web Services
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...ChemAxon
 
AI/ML Week: Improve Public Health
AI/ML Week: Improve Public HealthAI/ML Week: Improve Public Health
AI/ML Week: Improve Public HealthAmazon Web Services
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGUmair Shafique
 
Gary Hope - Machine Learning: It's Not as Hard as you Think
Gary Hope - Machine Learning: It's Not as Hard as you ThinkGary Hope - Machine Learning: It's Not as Hard as you Think
Gary Hope - Machine Learning: It's Not as Hard as you ThinkSaratoga
 
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Servio Fernando Lima Reina
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Robert Brooks, PwC
Robert Brooks, PwCRobert Brooks, PwC
Robert Brooks, PwCCSSaunders
 
How to Create a Big Data Culture in Pharma
How to Create a Big Data Culture in PharmaHow to Create a Big Data Culture in Pharma
How to Create a Big Data Culture in PharmaChris Waller
 

Similar to Crowdsourcing big data_industry_jun-25-2015_for_slideshare (20)

Amazon Machine Learning for Developers
Amazon Machine Learning for DevelopersAmazon Machine Learning for Developers
Amazon Machine Learning for Developers
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Healthcare Analytics Summit Keynote Fall 2017
Healthcare Analytics Summit Keynote Fall 2017Healthcare Analytics Summit Keynote Fall 2017
Healthcare Analytics Summit Keynote Fall 2017
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
 
David Cocker big data MDCPartners ta-scan
David Cocker big data MDCPartners ta-scanDavid Cocker big data MDCPartners ta-scan
David Cocker big data MDCPartners ta-scan
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurneyCertus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
Certus Accelerate - Why You Need to Invest in Your Data by Vincent McBurney
 
Bio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
Bio IT World 2019 - AI For Healthcare - Simon Taylor, LucidworksBio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
Bio IT World 2019 - AI For Healthcare - Simon Taylor, Lucidworks
 
COVID-19 - How to Improve Outcomes By Improving Data
COVID-19 - How to Improve Outcomes By Improving DataCOVID-19 - How to Improve Outcomes By Improving Data
COVID-19 - How to Improve Outcomes By Improving Data
 
AI/ML Webinar - Improve Public Health
AI/ML Webinar - Improve Public HealthAI/ML Webinar - Improve Public Health
AI/ML Webinar - Improve Public Health
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
 
AI/ML Week: Improve Public Health
AI/ML Week: Improve Public HealthAI/ML Week: Improve Public Health
AI/ML Week: Improve Public Health
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
Mohammed AL Madhani
Mohammed AL MadhaniMohammed AL Madhani
Mohammed AL Madhani
 
Gary Hope - Machine Learning: It's Not as Hard as you Think
Gary Hope - Machine Learning: It's Not as Hard as you ThinkGary Hope - Machine Learning: It's Not as Hard as you Think
Gary Hope - Machine Learning: It's Not as Hard as you Think
 
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
 
Machine learning in Banks
Machine learning in BanksMachine learning in Banks
Machine learning in Banks
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Robert Brooks, PwC
Robert Brooks, PwCRobert Brooks, PwC
Robert Brooks, PwC
 
How to Create a Big Data Culture in Pharma
How to Create a Big Data Culture in PharmaHow to Create a Big Data Culture in Pharma
How to Create a Big Data Culture in Pharma
 

More from Tyler Schnoebelen

Emoji are great and/or they will destroy the world
Emoji are great and/or they will destroy the worldEmoji are great and/or they will destroy the world
Emoji are great and/or they will destroy the worldTyler Schnoebelen
 
The Ethics of Everybody Else
The Ethics of Everybody ElseThe Ethics of Everybody Else
The Ethics of Everybody ElseTyler Schnoebelen
 
Introduction to emotion detection
Introduction to emotion detectionIntroduction to emotion detection
Introduction to emotion detectionTyler Schnoebelen
 
Studying emotion in the field
Studying emotion in the fieldStudying emotion in the field
Studying emotion in the fieldTyler Schnoebelen
 
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in betweenVariation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in betweenTyler Schnoebelen
 
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Tyler Schnoebelen
 
Towards a dictionary of the future
Towards a dictionary of the futureTowards a dictionary of the future
Towards a dictionary of the futureTyler Schnoebelen
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Tyler Schnoebelen
 

More from Tyler Schnoebelen (9)

Emoji are great and/or they will destroy the world
Emoji are great and/or they will destroy the worldEmoji are great and/or they will destroy the world
Emoji are great and/or they will destroy the world
 
The Ethics of Everybody Else
The Ethics of Everybody ElseThe Ethics of Everybody Else
The Ethics of Everybody Else
 
Introduction to emotion detection
Introduction to emotion detectionIntroduction to emotion detection
Introduction to emotion detection
 
Studying emotion in the field
Studying emotion in the fieldStudying emotion in the field
Studying emotion in the field
 
Emoji linguistics
Emoji linguisticsEmoji linguistics
Emoji linguistics
 
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in betweenVariation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
 
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
 
Towards a dictionary of the future
Towards a dictionary of the futureTowards a dictionary of the future
Towards a dictionary of the future
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Crowdsourcing big data_industry_jun-25-2015_for_slideshare

  • 1. FCPCCS - Big Data and Crowdsourcing Pattern-recognition and the crowd
  • 2. FCPCCS - Big Data and Crowdsourcing What would you do with unlimited human analysts?
  • 3. FCPCCS - Big Data and Crowdsourcing
  • 4. FCPCCS - Big Data and Crowdsourcing People DataCategories
  • 5. FCPCCS - Big Data and Crowdsourcing Models
  • 6. FCPCCS - Big Data and Crowdsourcing
  • 7. FCPCCS - Big Data and Crowdsourcing
  • 8. FCPCCS - Big Data and Crowdsourcing Unstructured data gets structured (bonus: a system that gets smarter over time) Adaptive System Machine Learning Optimization Human Annotation Prediction Engine Structured Data Reports Action
  • 9. FCPCCS - Big Data and Crowdsourcing 80% 85% 99% 83% 81% 88% 87% 90% 73% 91% 0% 50% 100% News Category 4 News Category 2 News Category 1 Manufacturing Health Sciences Finding Relevant News Articles % analyst time saved % accuracy (compared to humans) Efficiency of human time is a major benefit
  • 10. FCPCCS - Big Data and Crowdsourcing
  • 11. FCPCCS - Big Data and Crowdsourcing
  • 12. FCPCCS - Big Data and Crowdsourcing The importance of definition • If people can’t agree on what’s-in and what’s-out, it’s hard to train a machine
  • 13. FCPCCS - Big Data and Crowdsourcing
  • 14. FCPCCS - Big Data and Crowdsourcing Wait a sec! Aren’t these ducks? (Can we agree to disagree?)
  • 15. FCPCCS - Big Data and Crowdsourcing The importance of definition • If people can’t agree on what’s-in and what’s-out, it’s hard to train a machine • In our case toxicity was defined as: • ad hominem attacks (directed at specific people) • bigoted comments (e.g., sexist, racist, homophobic, etc) • Set definitions • Then see if people are consistent • Run pilots • Do inter-annotator agreement • Iterate
  • 16. FCPCCS - Big Data and Crowdsourcing Inter-annotator agreement: is everyone measuring the same way?
  • 17. FCPCCS - Big Data and Crowdsourcing Quick recommendation for inter-annotator agreement • You can measure consistency, probably the best way is Krippendorff’s alpha • Don’t use percentage agreement! Particularly when data are skewed towards one category. • If 95% of the data fall under one category label, then random coding would still have two people agree so much that % agreement would make you think you had a reliable study (even though you wouldn’t) • And you can ALSO use models to check these things
  • 18. FCPCCS - Big Data and Crowdsourcing Finding healthy communities (supportive)
  • 19. FCPCCS - Big Data and Crowdsourcing And unhealthy ones (toxic)
  • 20. FCPCCS - Big Data and Crowdsourcing
  • 21. FCPCCS - Big Data and Crowdsourcing Collect data and annotations—then interrogate it Human annotations Which people/categories should we be wary of? Which annotations do we select to train a model with? A classifier that can predict unseen data
  • 22. FCPCCS - Big Data and Crowdsourcing Routing messages that matter
  • 23. FCPCCS - Big Data and Crowdsourcing Processing millions of SMS in 12 African languages Intent of sender (i.e. report a problem, ask a question or make a suggestion) Categorization (i.e. orphans and vulnerable children, violence against children, health, nutrition) Language detection (i.e. English, Acholi, Karamojong, Luganda, Nkole, Swahili, Lango) Location (i.e. village names)
  • 24. FCPCCS - Big Data and Crowdsourcing
  • 25. FCPCCS - Big Data and Crowdsourcing 1.4%
  • 26. FCPCCS - Big Data and Crowdsourcing
  • 27. FCPCCS - Big Data and Crowdsourcing Top 3 categories in Nigeria 9.69% 17.68% 39.44% Employment U-report support Health
  • 28. FCPCCS - Big Data and Crowdsourcing The Donald Rumsfeld Question
  • 29. FCPCCS - Big Data and Crowdsourcing How do I find what I don’t know I don’t know?
  • 30. FCPCCS - Big Data and Crowdsourcing Negative topics in Walmart employee reviews Hours/Benefits 968 518 Management 2,404 Work/life balance 1,241 Company Values Dealing With Customers 658 Training & Expectation 968 Low Pay 1,446
  • 31. FCPCCS - Big Data and Crowdsourcing Common Pros among Employees Common Cons among Employees 37% 25% 24% 41% 27% 17% 0% 10% 20% 30% 40% 50% Current Former 24% 16% 13% 13%14% 16% 12% 0% 10% 20% 30% Current Former Structuring unstructured data lets you combine it with other metadata
  • 32. FCPCCS - Big Data and Crowdsourcing Question: What improves models the most?
  • 33. FCPCCS - Big Data and Crowdsourcing Instead of worrying about the algorithms in the machine
  • 34. FCPCCS - Big Data and Crowdsourcing It’s almost always better to just get more pandas
  • 35. FCPCCS - Big Data and Crowdsourcing How else do you verify?  We assess model accuracy using cross-validation.  Instead of using all annotated data to train a model, you hold out a random 10% and build the model with the rest.  Then you predict against that 10%. You do this 10 times and average the accuracy.  Precision measures “if we automatically label something as X, how often are we right?”  Recall measures “how much of stuff that SHOULD have label X are actually given label X?”
  • 36. FCPCCS - Big Data and Crowdsourcing The system gets smarter  Here’s what happens going across the first 2,543 annotations on one REALLY low signal classification task  By 9,744 annotations, our accuracy is 97%
  • 37. FCPCCS - Big Data and Crowdsourcing Other tasks are more straight-forward 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 50 100 150 200 F-score Number of paragraphs annotated F-scores go up with more annotations Disease Country Reported_deaths Reported_cases Date Issue Location People affected # of deaths Event date
  • 38. FCPCCS - Big Data and Crowdsourcing Project workflow Phase 1:Data • Data capture, normalization and loading Phase 2:Discovery • Topic discovery • Category creation • Expert data annotation • Category verification Phase 3:Training • Guideline creation • Annotator validation • Model training Phase 4: Optimization • Model evaluation • Category refinement Phase 5:Model Deployment • Full system integration • Model performance • Metrics reporting
  • 39. FCPCCS - Big Data and Crowdsourcing email tyler@idibon.com twitter @idibon www idibon.com THANK YOU!

Editor's Notes

  1. http://nypost.com/2015/02/07/meet-the-bird-brains-batty-enough-to-go-bird-watching-in-winter/
  2. This is the basic stuff you want. (It’s a little self-serving because Idibon’s adaptive system is what makes us special but we really do believe that optimizing training on relevant data with meaningful categories is THE way to deliver business value.) By using computers to create an initial understanding of data and elevate specific cases for Human Annotation, we use computers to make human decisions smarter, and humans to make computer decisions smarter. Our system optimizes work by using cutting edge Machine Learning that improves accuracy and learns iteratively. Our Prediction Engine provides initial conclusions for further evaluation by human analysts and is also what allows us to scale ten of millions messages a day. Our Optimization process teaches our algorithm what results to select for, essentially refining its accuracy. The key take away here is that we optimize for human analysts time; we can cluster data initially and automatically, then we can escalate specific cases to human annotation. Much of the learning is unsupervised and therefore faster, cheaper and actually more accurate. After iterations in our adaptive system, previously unstructured data is now structured. This structured data can be delivered in different outputs, including CSV file exports for your analysts to build reports or direct routing to customer service agents to take action.
  3. As you can see—different categories have different results. News category 1 is awesome—you really don’t have to show human analysts much data to get all the Relevant stuff (you show them 10% of the data and still get 99% of what the client cares about) Manufacturing is less awesome. You can reduce your workload to just 73% of what it was…but you have to accept that you’ll only get 83% of the stuff you care about (you’ll miss 17%). If you want to get more like 90% accuracy, you need to review more documents. You “only” get a workload reduction of ~56%. Ideally, you want a system that gets better over time.
  4. First case study! http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
  5. Lately, Reddit has gotten a lot of press for having terrible, awful communities
  6. See also http://cswww.essex.ac.uk/Research/nle/arrau/icagr.pdf
  7. http://blog.ioactive.com/2013/05/security-101-machine-learning-and-big.html
  8. The important thing is having definitions people will agree with and can be consistent with…and which actually answer organizational objectives. Do you care about whether duck decoys and/or rubber duckies are ducks or not? WHY? http://blog.ioactive.com/2013/05/security-101-machine-learning-and-big.html
  9. The trickiest thing about ad hominem attacks as a definition is: what to do with trash talk in sports/gaming. Tricky!
  10. The trickiest thing about ad hominem attacks as a definition is: what to do with trash talk in sports/gaming. Tricky!
  11. This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/ The DIY (do it yourself) group is the one that is most supportive and least toxic. This data ties to actual upvote/downvote behavior. Meaning that you’re not actually a supportive community if everyone down votes the supportive comments, nor are you a toxic community if everyone downvotes the toxic comments.
  12. This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/ It’s only when everyone upvotes toxic comments that you are a toxic community by our definition here.
  13. We also specifically looked at bigotry. Indeed, /r/TheRedPill, is seen as the most bigoted. It’s a subreddit dedicated to proud male chauvinism. This is interactive, check out: http://idibon.com/toxicity-in-reddit-communities-a-journey-to-the-darkest-depths-of-the-interwebs/
  14. Case study three: http://idibon.com/idibon-supports-unicef-provide-natural-language-processing-sms-based-social-monitoring-systems-africa/ Photo: http://unicefaids.tumblr.com/post/37835112363/photo-young-people-in-kitwe-zambia-explore-the
  15. The United Nations Children’s Fund (UNICEF) is a United Nations branch that provides long-term humanitarian and developmental assistance to children and mothers in developing countries. Idibon provides scalable natural language processing and analytics to UNICEF’s multinational U-report applications, enabling UNICEF to process text messages sent from citizens in Uganda and Nigeria “to better understand and empower marginalized communities that are often excluded due to language barriers.” (Evan Wheeler, CTO of UNICEF’s Global Innovation Centre) UNICEF U-report only has six dedicated analysts to process and respond to millions of messages a month and Idibon’s technology enables the organization to operate efficiently and at scale. Specifically, Idibon processes each SMS in four ways: Intent of sender – to prioritize support/services (UNICEF receives more than a million messages a month and can only respond to about a thousand) Categorization – to prioritize support/services and to route to appropriate analyst Language detection – to route to appropriate analyst Location – to identify where to send support/services Press release: http://unicefstories.org/2015/02/09/idibon-supports-unicef-to-provide-natural-language-processing-to-sms-based-social-monitoring-systems-in-africa/
  16. Environment is an important issue. But it looks to be about 1.4% of the data…which means you do have to get enough data to build a model. Note that different countries/languages talk about the environment differently (Uganda=droughts, cows; Nigeria: oil). So you may have more or less heterogeneity in your rarer categories. Image from http://www.theatlantic.com/photo/2011/06/nigeria-the-cost-of-oil/100082/ For more recent news: http://www.theguardian.com/environment/2015/jan/07/niger-delta-communities-to-sue-shell-in-london-for-oil-spill-compensation
  17. “Environment” is clearly an important issue in Nigeria but only 1.4% of the messages are classified that way. (One other thing: high/low percentages don’t necessarily correspond to personal or societal importance.)
  18. Each needle found makes the next one easier to find, buuuuuuut some things you want to find are just too rare. You can’t model things that aren’t in the data.
  19. At UNICEF, different people care about different categories—the people who respond to rumors of ebola outbreaks or cures are different than the people trying to keep track of economic issues. Most actionable is, of course, finding people who specifically require support about participating in the community.
  20. Pay and Opportunities are much less of a pro once employees have left Walmart and becomes more of a con Management is highly criticised amongst both current and former
  21. 9,744 annotations total 951 for engageable 8793 for irrelevant