A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

A Robust Framework for Classifying Evolving Document
Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar
**Eurecat, Barcelona, Spain

Data Stream Processing
Challenges
1. Infinite length
2. Concept-drift (change in data distributions)
3. Concept-evolution (new categories emerge)
4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream

Social Media Stream Processing in
Time-Critical Situations
2013 Pakistan Earthquake
September 28 at 07:34 UTC
2010 Haiti Earthquake
January 12 at 21:53 UTC
Social Media
Platforms
Availability of Immense Data:
Around 16 thousands tweets
per minute were posted during
the hurricane Sandy in the US.
Opportunities:
- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks

Social Media Data Streams
Classification
We address two issues in the classification (supervised) of
social media streams:
1. How to keep the categories used for classification up-to-date?
1. While adding new categories, how to maintain high
classification accuracy?

Input and Output
Category A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPUTOUTPUT

Problem Definition
Given as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:
• There are N new categories:
• Pre-existing categories are slightly modified:
• New categories are different than the old:
• The size of the miscellaneous category is reduced:

Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)

Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items
2. Clustering using COD-Means
3. Labeling errors identification (using outlier detection)
1
2
3
4

Constraints Formation
1. Items in same category have Must-link constraints
2. Items belonging to different categories have Cannot-link
constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints

Objective Function
Standard distortion error
If an ML constraint if violated
then the cost of the violation is
equal to the distance between
the two centroids that contain
the instances.
If a CL constraint is violated then
the error cost is the distance
between the centroid C assigned
to the pair and its nearest
centroid h(c).

Assignment and Update Rules
Rule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and
Its nearest centroid
is the Kronecker delta function
i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified
average of all points that belong to a
cluster.

COD-Means Algorithm
Algorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment
rules considering ML and CL constraints
Points in each cluster are sorted based
on their distance to the centroid and
top l are removed and inserted into L

Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and
novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means
algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?
4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter

Clusters Novelty and Coherence
K-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers.
• As the value of L increases, more tight and coherent clusters are observed.

Data Improvements Evaluation
1. Labeling errors in non-miscellaneous categories
2. Items incorrectly labeled as miscellaneous

Impact on Classification Performance

Conclusion
• Our setting: supervised stream classification
• We presented COD-Means to learn novel
categories and labeling errors from live streams
• We used real-word Twitter datasets and
performed extensive experimentation
• We showed that COD-Means is able to identify
new categories and labeling errors efficiently

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Similar to A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting (20)

More from Muhammad Imran

More from Muhammad Imran (14)

Recently uploaded

Recently uploaded (20)

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Editor's Notes