SlideShare a Scribd company logo
1 of 21
Download to read offline
Making the Most of Tweet-Inherent Features for
Social Spam Detection on Twitter
Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter
Department of Computer Science
University of Warwick
18th May 2015
Social Spam on Twitter
Motivation
• Social spam is an important issue in social media services
such as Twitter, e.g.:
• Users inject tweets in trending topics.
• Users reply with promotional messages providing a link.
• We want to be able to identify these spam tweets in a
Twitter stream.
Social Spam on Twitter
How Did we Feel the Need to Identify Spam?
• We started tracking events via streaming API.
• They were often riddled with noisy tweets.
Social Spam on Twitter
Example
Social Spam on Twitter
Our Approach
• Detection of spammers: unsuitable, we couldn’t
aggregate a user’s data from a stream.
• Alternative solution: Determine if tweet is spam from its
inherent features.
Social Spam on Twitter
Definitions
• Spam originally coined for unsolicited email.
• How to define spam for Twitter? (not easy!)
• Twitter has own definition of spam, where certain level of
advertisements is allowed:
• It rather refers to the user level rather than tweet level, e.g.,
users who massively follow others.
• Harder to define a spam than a spammer.
Social Spam on Twitter
Our Definition
• Twitter spam: noisy content produced by users who
express a different behaviour from what the system is
intended for, and has the goal of grabbing attention by
exploiting the social media service’s characteristics.
Spammer vs. Spam Detection
What Did Others Do?
• Most previous work focused on spammer detection (users).
• They used features which are not readily available in a
tweet:
• For example, historical user behaviour and network
features.
• Not feasible for our use.
Spammer vs. Spam Detection
What Do We Want To Do Instead?
• (Near) Real-time spam detection, limited to features
readily available in a stream of tweets.
• Contributions:
• Test on two existing datasets, adapted to our purposes.
• Definition of different feature sets.
• Compare different classification algorithms.
• Investigate the use of different tweet-inherent features.
Datasets
• We relied on two (spammer vs non-spammer) datasets:
• Social Honeypot (Lee et al., 2011 [1]): used social honeypots
to attract spammers.
• 1KS-10KN (Yang et al., 2011 [2]): harvested tweets
containing certain malicious URLs.
• Spammer dataset to our spam dataset: Randomly select
one tweet from each spammer or legitimate user.
• Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1).
• 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
Feature Engineering
User features Content features
Length of profile name Number of words
Length of profile description Number of characters
Number of followings (FI) Number of white spaces
Number of followers (FE) Number of capitalization words
Number of tweets posted Number of capitalization words per word
Age of the user account, in hours (AU) Maximum word length
Ratio of number of followings and followers (FE/FI) Mean word length
Reputation of the user (FE/(FI + FE)) Number of exclamation marks
Following rate (FI/AU) Number of question marks
Number of tweets posted per day Number of URL links
Number of tweets posted per week Number of URL links per word
N-grams Number of hashtags
Uni + bi-gram or bi + tri-gram Number of hashtags per word
Number of mentions
Sentiment features Number of mentions per word
Automatically created sentiment lexicons Number of spam words
Manually created sentiment lexicons Number of spam words per word
Part of speech tags of every tweet
Evaluation
Experiment Settings
• 5 widely-used classification algorithms: Bernoulli Naive
Bayes, KNN, SVM, Decision Tree and Random Forests.
• Hyperparameters optimised from a subset of the dataset
separate from train/test sets.
• All 4 feature sets were combined.
• 10-fold cross-validation.
Evaluation
Selection of Classifier
Classifier
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F1-measure
Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789
KNN 0.924 0.706 0.798 0.802 0.778 0.790
SVM 0.872 0.708 0.780 0.844 0.817 0.830
Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915
Random Forest 0.993 0.716 0.831 0.941 0.950 0.946
• Random Forests outperform others in terms of
F1-measure and Precision.
• Better performance on Social Honeypot (1:1 ratio rather
than 1:10?).
• Results only 4% below original papers, which require
historic user features.
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
User features (U) 0.895 0.709 0.791 0.938 0.940 0.940
Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762
Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743
Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775
Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775
Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720
Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702
• Testing feature sets one by one:
• User features (U) most determinant for Social Honeypot.
• N-gram features best for 1KS-10KN.
• Potentially due to diff. dataset generation approaches?
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
Single feature set 0.943 0.726 0.820 0.938 0.940 0.940
U + C 0.974 0.708 0.819 0.938 0.949 0.943
U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943
U + S 0.948 0.732 0.825 0.940 0.944 0.942
Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770
C + S 0.970 0.649 0.777 0.778 0.762 0.770
C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770
U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941
U + C + S 0.982 0.704 0.819 0.937 0.948 0.942
U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937
C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782
U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942
• However, when we combine feature sets:
• The same approach performs best (F1) for both: U + Bi &
Tri-gram (Tf).
• Combining features helps us capture diff. types of spam
tweets.
Evaluation
Computational Efficiency
• Beyond accuracy, how can all these features be applied
efficiently in a stream?
Evaluation
Computational Efficiency
Feature set
Comp. time (seconds)
for 1k tweets
User features 0.0057
N-gram 0.3965
Sentiment features 20.9838
Number of spam words (NSW) 19.0111
Part-of-speech counts (POS) 0.6139
Content features including NSW and POS 20.2367
Content features without NSW 1.0448
Content features without POS 19.6165
• Tested on regular computer (2.8 GHz Intel Core i7 processor
and 16 GB memory).
• The features that performed best in combination (User
and N-grams) are those most efficiently calculated.
Conclusion
• Random Forests were found to be the most accurate
classifier.
• Comparable performance to previous work (-4%) while
limiting features to those in a tweet.
• The use of multiple feature sets increases the possibility
to capture different spam types, and makes it more
difficult for spammers to evade.
• Diff. features perform better when used separately, but
same features are useful when combined.
Future Work
• Spam corpus constructed by picking tweets from
spammers.
• Need to study if legitimate users also likely to post spam
tweets, and how it could affect the results.
• A more recent, manually labelled spam/non-spam
dataset.
• Feasibility of cross-dataset spam classification?
That’s it!
• Any Questions?
K. Lee, B. D. Eoff, and J. Caverlee.
Seven months with the devils: A long-term study of content
polluters on twitter.
In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors,
ICWSM. The AAAI Press, 2011.
C. Yang, R. C. Harkreader, and G. Gu.
Die free or live hard? empirical evaluation and new design for
fighting evolving twitter spammers.
In Proceedings of the 14th International Conference on Recent
Advances in Intrusion Detection, RAID’11, pages 318–337,
Berlin, Heidelberg, 2011. Springer-Verlag.

More Related Content

What's hot

Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionvineeta vineeta
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection projectHarshdaGhai
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter DataNurendra Choudhary
 
Rsa and diffie hellman algorithms
Rsa and diffie hellman algorithmsRsa and diffie hellman algorithms
Rsa and diffie hellman algorithmsdaxesh chauhan
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learningdataalcott
 
Sentiment Analysis on Twitter
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on TwitterSubarno Pal
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksAshish Arora
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using mlPravin Katiyar
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningNihar Suryawanshi
 
Detecting Fake News Through NLP
Detecting Fake News Through NLPDetecting Fake News Through NLP
Detecting Fake News Through NLPSakha Global
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment AnalysisRebecca Williams
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysisM. Atif Qureshi
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionkalpesh1908
 
Seminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningSeminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningParvathi Sanil Nair
 
Application of Machine Learning in Cybersecurity
Application of Machine Learning in CybersecurityApplication of Machine Learning in Cybersecurity
Application of Machine Learning in CybersecurityPratap Dangeti
 

What's hot (20)

Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
Final Report(SuddhasatwaSatpathy)
Final Report(SuddhasatwaSatpathy)Final Report(SuddhasatwaSatpathy)
Final Report(SuddhasatwaSatpathy)
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection project
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
 
Rsa and diffie hellman algorithms
Rsa and diffie hellman algorithmsRsa and diffie hellman algorithms
Rsa and diffie hellman algorithms
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learning
 
Sentiment Analysis on Twitter
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on Twitter
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social Networks
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine Learning
 
Captcha ppt
Captcha pptCaptcha ppt
Captcha ppt
 
Detecting Fake News Through NLP
Detecting Fake News Through NLPDetecting Fake News Through NLP
Detecting Fake News Through NLP
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Seminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningSeminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learning
 
Application of Machine Learning in Cybersecurity
Application of Machine Learning in CybersecurityApplication of Machine Learning in Cybersecurity
Application of Machine Learning in Cybersecurity
 

Viewers also liked

Detecting Spammers on Social Networks
Detecting Spammers on Social NetworksDetecting Spammers on Social Networks
Detecting Spammers on Social NetworksGianluca Stringhini
 
Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Carlos Laorden
 
12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for youOnline Promotion Success, Inc.
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionSOYEON KIM
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam FilteringiNazneen
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentationnewsan2001
 
Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Ambarish Pande
 

Viewers also liked (11)

Detecting Spammers on Social Networks
Detecting Spammers on Social NetworksDetecting Spammers on Social Networks
Detecting Spammers on Social Networks
 
Spam, security
Spam, securitySpam, security
Spam, security
 
Twitter Spam
Twitter SpamTwitter Spam
Twitter Spam
 
Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013
 
12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS Detection
 
Bulk sms
Bulk smsBulk sms
Bulk sms
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentation
 
Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.
 
Spam
SpamSpam
Spam
 

Similar to Microposts2015 - Social Spam Detection on Twitter

SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunk
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...LINE Corp.
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsYuanyuan Tian
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVIntoTheMinds
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVFrancisco Couto
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsRavi Kiran Holur Vijay
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...yeung2000
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User EngagementBehnoush Abdollahi
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...Tuan Hoang
 

Similar to Microposts2015 - Social Spam Detection on Twitter (20)

SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP Interactive
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on Microblogs
 
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TV
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon Reviews
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User Engagement
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 

More from azubiaga

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaazubiaga
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Mediaazubiaga
 
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social NewsgatheringCurating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social Newsgatheringazubiaga
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discoveryazubiaga
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...azubiaga
 
Harnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource ClassificationHarnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource Classificationazubiaga
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Socialesazubiaga
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualizationazubiaga
 
Getting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page ClassificationGetting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page Classificationazubiaga
 
Enhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social TagsEnhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social Tagsazubiaga
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?azubiaga
 
Etiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaEtiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaazubiaga
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentationazubiaga
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classificationazubiaga
 

More from azubiaga (14)

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social media
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
 
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social NewsgatheringCurating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discovery
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
 
Harnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource ClassificationHarnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource Classification
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Sociales
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualization
 
Getting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page ClassificationGetting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page Classification
 
Enhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social TagsEnhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social Tags
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
 
Etiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaEtiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzea
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentation
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classification
 

Recently uploaded

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Microposts2015 - Social Spam Detection on Twitter

  • 1. Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter Department of Computer Science University of Warwick 18th May 2015
  • 2. Social Spam on Twitter Motivation • Social spam is an important issue in social media services such as Twitter, e.g.: • Users inject tweets in trending topics. • Users reply with promotional messages providing a link. • We want to be able to identify these spam tweets in a Twitter stream.
  • 3. Social Spam on Twitter How Did we Feel the Need to Identify Spam? • We started tracking events via streaming API. • They were often riddled with noisy tweets.
  • 4. Social Spam on Twitter Example
  • 5. Social Spam on Twitter Our Approach • Detection of spammers: unsuitable, we couldn’t aggregate a user’s data from a stream. • Alternative solution: Determine if tweet is spam from its inherent features.
  • 6. Social Spam on Twitter Definitions • Spam originally coined for unsolicited email. • How to define spam for Twitter? (not easy!) • Twitter has own definition of spam, where certain level of advertisements is allowed: • It rather refers to the user level rather than tweet level, e.g., users who massively follow others. • Harder to define a spam than a spammer.
  • 7. Social Spam on Twitter Our Definition • Twitter spam: noisy content produced by users who express a different behaviour from what the system is intended for, and has the goal of grabbing attention by exploiting the social media service’s characteristics.
  • 8. Spammer vs. Spam Detection What Did Others Do? • Most previous work focused on spammer detection (users). • They used features which are not readily available in a tweet: • For example, historical user behaviour and network features. • Not feasible for our use.
  • 9. Spammer vs. Spam Detection What Do We Want To Do Instead? • (Near) Real-time spam detection, limited to features readily available in a stream of tweets. • Contributions: • Test on two existing datasets, adapted to our purposes. • Definition of different feature sets. • Compare different classification algorithms. • Investigate the use of different tweet-inherent features.
  • 10. Datasets • We relied on two (spammer vs non-spammer) datasets: • Social Honeypot (Lee et al., 2011 [1]): used social honeypots to attract spammers. • 1KS-10KN (Yang et al., 2011 [2]): harvested tweets containing certain malicious URLs. • Spammer dataset to our spam dataset: Randomly select one tweet from each spammer or legitimate user. • Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1). • 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
  • 11. Feature Engineering User features Content features Length of profile name Number of words Length of profile description Number of characters Number of followings (FI) Number of white spaces Number of followers (FE) Number of capitalization words Number of tweets posted Number of capitalization words per word Age of the user account, in hours (AU) Maximum word length Ratio of number of followings and followers (FE/FI) Mean word length Reputation of the user (FE/(FI + FE)) Number of exclamation marks Following rate (FI/AU) Number of question marks Number of tweets posted per day Number of URL links Number of tweets posted per week Number of URL links per word N-grams Number of hashtags Uni + bi-gram or bi + tri-gram Number of hashtags per word Number of mentions Sentiment features Number of mentions per word Automatically created sentiment lexicons Number of spam words Manually created sentiment lexicons Number of spam words per word Part of speech tags of every tweet
  • 12. Evaluation Experiment Settings • 5 widely-used classification algorithms: Bernoulli Naive Bayes, KNN, SVM, Decision Tree and Random Forests. • Hyperparameters optimised from a subset of the dataset separate from train/test sets. • All 4 feature sets were combined. • 10-fold cross-validation.
  • 13. Evaluation Selection of Classifier Classifier 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F1-measure Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789 KNN 0.924 0.706 0.798 0.802 0.778 0.790 SVM 0.872 0.708 0.780 0.844 0.817 0.830 Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915 Random Forest 0.993 0.716 0.831 0.941 0.950 0.946 • Random Forests outperform others in terms of F1-measure and Precision. • Better performance on Social Honeypot (1:1 ratio rather than 1:10?). • Results only 4% below original papers, which require historic user features.
  • 14. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure User features (U) 0.895 0.709 0.791 0.938 0.940 0.940 Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762 Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743 Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775 Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775 Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720 Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702 • Testing feature sets one by one: • User features (U) most determinant for Social Honeypot. • N-gram features best for 1KS-10KN. • Potentially due to diff. dataset generation approaches?
  • 15. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure Single feature set 0.943 0.726 0.820 0.938 0.940 0.940 U + C 0.974 0.708 0.819 0.938 0.949 0.943 U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943 U + S 0.948 0.732 0.825 0.940 0.944 0.942 Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770 C + S 0.970 0.649 0.777 0.778 0.762 0.770 C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770 U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941 U + C + S 0.982 0.704 0.819 0.937 0.948 0.942 U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937 C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782 U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942 • However, when we combine feature sets: • The same approach performs best (F1) for both: U + Bi & Tri-gram (Tf). • Combining features helps us capture diff. types of spam tweets.
  • 16. Evaluation Computational Efficiency • Beyond accuracy, how can all these features be applied efficiently in a stream?
  • 17. Evaluation Computational Efficiency Feature set Comp. time (seconds) for 1k tweets User features 0.0057 N-gram 0.3965 Sentiment features 20.9838 Number of spam words (NSW) 19.0111 Part-of-speech counts (POS) 0.6139 Content features including NSW and POS 20.2367 Content features without NSW 1.0448 Content features without POS 19.6165 • Tested on regular computer (2.8 GHz Intel Core i7 processor and 16 GB memory). • The features that performed best in combination (User and N-grams) are those most efficiently calculated.
  • 18. Conclusion • Random Forests were found to be the most accurate classifier. • Comparable performance to previous work (-4%) while limiting features to those in a tweet. • The use of multiple feature sets increases the possibility to capture different spam types, and makes it more difficult for spammers to evade. • Diff. features perform better when used separately, but same features are useful when combined.
  • 19. Future Work • Spam corpus constructed by picking tweets from spammers. • Need to study if legitimate users also likely to post spam tweets, and how it could affect the results. • A more recent, manually labelled spam/non-spam dataset. • Feasibility of cross-dataset spam classification?
  • 20. That’s it! • Any Questions?
  • 21. K. Lee, B. D. Eoff, and J. Caverlee. Seven months with the devils: A long-term study of content polluters on twitter. In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors, ICWSM. The AAAI Press, 2011. C. Yang, R. C. Harkreader, and G. Gu. Die free or live hard? empirical evaluation and new design for fighting evolving twitter spammers. In Proceedings of the 14th International Conference on Recent Advances in Intrusion Detection, RAID’11, pages 318–337, Berlin, Heidelberg, 2011. Springer-Verlag.