SlideShare a Scribd company logo
1 of 18
A Robust Framework for Classifying Evolving Document
Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar
**Eurecat, Barcelona, Spain
Data Stream Processing
Challenges
1. Infinite length
2. Concept-drift (change in data distributions)
3. Concept-evolution (new categories emerge)
4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream
Social Media Stream Processing in
Time-Critical Situations
2013 Pakistan Earthquake
September 28 at 07:34 UTC
2010 Haiti Earthquake
January 12 at 21:53 UTC
Social Media
Platforms
Availability of Immense Data:
Around 16 thousands tweets
per minute were posted during
the hurricane Sandy in the US.
Opportunities:
- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks
Social Media Data Streams
Classification
We address two issues in the classification (supervised) of
social media streams:
1. How to keep the categories used for classification up-to-date?
1. While adding new categories, how to maintain high
classification accuracy?
Input and Output
Category A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPUTOUTPUT
Problem Definition
Given as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:
• There are N new categories:
• Pre-existing categories are slightly modified:
• New categories are different than the old:
• The size of the miscellaneous category is reduced:
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
1
2
3
4
Constraints Formation
1. Items in same category have Must-link constraints
2. Items belonging to different categories have Cannot-link
constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints
Objective Function
Standard distortion error
If an ML constraint if violated
then the cost of the violation is
equal to the distance between
the two centroids that contain
the instances.
If a CL constraint is violated then
the error cost is the distance
between the centroid C assigned
to the pair and its nearest
centroid h(c).
Assignment and Update Rules
Rule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and
Its nearest centroid
is the Kronecker delta function
i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified
average of all points that belong to a
cluster.
COD-Means Algorithm
Algorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment
rules considering ML and CL constraints
Points in each cluster are sorted based
on their distance to the centroid and
top l are removed and inserted into L
Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and
novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means
algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?
4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter
Clusters Novelty and Coherence
K-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers.
• As the value of L increases, more tight and coherent clusters are observed.
Data Improvements Evaluation
1. Labeling errors in non-miscellaneous categories
2. Items incorrectly labeled as miscellaneous
Impact on Classification Performance
Conclusion
• Our setting: supervised stream classification
• We presented COD-Means to learn novel
categories and labeling errors from live streams
• We used real-word Twitter datasets and
performed extensive experimentation
• We showed that COD-Means is able to identify
new categories and labeling errors efficiently
Thank you for your attention!

More Related Content

Viewers also liked

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messageswire unitn
 
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Pierre Béland
 
The Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseMuhammad Imran
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...Artificial Intelligence Institute at UofSC
 

Viewers also liked (6)

D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social MessagesD-sieve : A Novel Data Processing Engine for Crises Related Social Messages
D-sieve : A Novel Data Processing Engine for Crises Related Social Messages
 
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
Maps from the Crowd in Crisis context / OpenStreetMap Response to humanitaria...
 
The Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster ResponseThe Role of Social Media and Artificial Intelligence for Disaster Response
The Role of Social Media and Artificial Intelligence for Disaster Response
 
Crisis Computing
Crisis ComputingCrisis Computing
Crisis Computing
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clusteringeSAT Publishing House
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...eSAT Journals
 
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...ijcisjournal
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classificationijcsit
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filteringpaperpublications3
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code CriticsLaurie Smith
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphsdash-javad
 
Caim discretization algorithm
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithmenok7
 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docpraveena06
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysisijdmtaiir
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionCamella Taylor
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
A Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisA Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisIRJET Journal
 

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting (20)

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clustering
 
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering different techniques to f...
 
Data clustering
Data clustering Data clustering
Data clustering
 
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
Clustering
ClusteringClustering
Clustering
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classification
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filtering
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code Critics
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphs
 
Caim discretization algorithm
Caim discretization algorithmCaim discretization algorithm
Caim discretization algorithm
 
V1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.docV1_I2_2012_Paper6.doc
V1_I2_2012_Paper6.doc
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 
Survey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data AnalysisSurvey of Data Mining Techniques on Crime Data Analysis
Survey of Data Mining Techniques on Crime Data Analysis
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
7 ijcse-01229
7 ijcse-012297 ijcse-01229
7 ijcse-01229
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
A Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset AnalysisA Review on Covid Detection using Cross Dataset Analysis
A Review on Covid Detection using Cross Dataset Analysis
 

More from Muhammad Imran

Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyMuhammad Imran
 
Damage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersMuhammad Imran
 
Image4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseMuhammad Imran
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodMuhammad Imran
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)Muhammad Imran
 
Summarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioMuhammad Imran
 
Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseMuhammad Imran
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseMuhammad Imran
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...Muhammad Imran
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Muhammad Imran
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Muhammad Imran
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaMuhammad Imran
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific MashupsMuhammad Imran
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOMuhammad Imran
 

More from Muhammad Imran (14)

Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A Survey
 
Damage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During DisastersDamage Assessment from Social Media Imagery Data During Disasters
Damage Assessment from Social Media Imagery Data During Disasters
 
Image4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster ResponseImage4Act: Online Social Media Image Processing for Disaster Response
Image4Act: Online Social Media Image Processing for Disaster Response
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social Good
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)
 
Summarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis ScenarioSummarizing Situational Tweets in Crisis Scenario
Summarizing Situational Tweets in Crisis Scenario
 
Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster Response
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster Response
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific Mashups
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECO
 

Recently uploaded

Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 

Recently uploaded (20)

Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

  • 1. A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting Muhammad Imran*, Sanjay Chawla*, Carlos Castillo** *Qatar Computing Research Institute, Doha, Qatar **Eurecat, Barcelona, Spain
  • 2. Data Stream Processing Challenges 1. Infinite length 2. Concept-drift (change in data distributions) 3. Concept-evolution (new categories emerge) 4. Limited labeled data Credit Card fraud detection Sensor data classification Social media stream mining Data stream
  • 3. Social Media Stream Processing in Time-Critical Situations 2013 Pakistan Earthquake September 28 at 07:34 UTC 2010 Haiti Earthquake January 12 at 21:53 UTC Social Media Platforms Availability of Immense Data: Around 16 thousands tweets per minute were posted during the hurricane Sandy in the US. Opportunities: - Early warning and event detection - Situational awareness - Actionable information extraction - Rapid crisis response - Post-disaster analysis Disease outbreaks
  • 4. Social Media Data Streams Classification We address two issues in the classification (supervised) of social media streams: 1. How to keep the categories used for classification up-to-date? 1. While adding new categories, how to maintain high classification accuracy?
  • 5. Input and Output Category A Category B Category C Miscellaneous Z Category A’ Category B’ Category C’ Z1 Z2 Z’ INPUTOUTPUT
  • 6. Problem Definition Given as input a data set of documents: Categorized into a taxonomy: containing Partitioning of documents into taxonomy: Our task is to produce a new taxonomy: With the following characteristics: • There are N new categories: • Pre-existing categories are slightly modified: • New categories are different than the old: • The size of the miscellaneous category is reduced:
  • 7. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection)
  • 8. Expert-Machine-Crowd Setting Constraints Outlier Detection (COD-Means): 1. Constraints formation using classified items 2. Clustering using COD-Means 3. Labeling errors identification (using outlier detection) 1 2 3 4
  • 9. Constraints Formation 1. Items in same category have Must-link constraints 2. Items belonging to different categories have Cannot-link constraints Category A Category B Category C Category Z Must-link Cannot-linkNote: Items in Z do not have any constraints
  • 10. Objective Function Standard distortion error If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances. If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
  • 11. Assignment and Update Rules Rule 1: For items without any constraints (standard distortion error) Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y Update rule: The update rule computes a modified average of all points that belong to a cluster.
  • 12. COD-Means Algorithm Algorithm 1 2 3 Initialization (e.g. random pick of k centroids) Assignment of items based on 3 assignment rules considering ML and CL constraints Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
  • 13. Dataset and Experiments 1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel? 2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers? 3. What is the impact of outlier on the quality of clusters generated by COD-Means? 4. Once refined clusters (without labeling errors) used in the training process, does the overall accuracy improves? 8 disaster-related datasets were used from Twitter
  • 14. Clusters Novelty and Coherence K-Means vs. COD-Means • The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
  • 15. Data Improvements Evaluation 1. Labeling errors in non-miscellaneous categories 2. Items incorrectly labeled as miscellaneous
  • 17. Conclusion • Our setting: supervised stream classification • We presented COD-Means to learn novel categories and labeling errors from live streams • We used real-word Twitter datasets and performed extensive experimentation • We showed that COD-Means is able to identify new categories and labeling errors efficiently
  • 18. Thank you for your attention!

Editor's Notes

  1. Data streams processing such as credit card fraud detection, sensor data classification, social media stream mining has a number of challenges such as infinite length, concept drift, concept evolution and limited labeled data
  2. In this work, we focus on social media stream classification. This is important during all types of disasters and emergencies. Our interest is to classify millions of messages that people post on social media after disasters.
  3. So, our setting is supervised classification of social media streams. We have humans who provide labels
  4. This show what is our input and what we expect as output.
  5. The COD-Means problem is NP-hard for k > 1 and L ≥ 0.
  6. - The update rule of cj computes a modified average of all points that belong to a cluster . The modification captures the number of elements in the cluster which violated the ML and CL constraints.
  7. Cohesiveness: intra/inter: intra: (avg. distance of elements inside a cluster) inter (avg. distance w.r.t to other clusters) Novelty: maxDist(Ci) − minDist(Ci) Where: maxDist(Ci) = max a∈Ci,b∈∪k i=1Ti d(a, b) and minDist(Ci) = min a∈Ci,b∈∪k i=1Ti d(a, b) .