SlideShare a Scribd company logo
1 of 29
Download to read offline
Spam Filtering Techniques


   UMAR M. A. ALHARAKY, B.SC.

    RAKAN RAZOUK, PH.D.
Introduction

 What is spam?


    Spam also called as Unsolicited Commercial Email (UCE)

    Involves sending messages by email to numerous recipients at
     the same time (Mass Emailing).

    Grew exponentially since 1990 but has leveled off recently and
     is no longer growing exponentially

    80% of all spam is sent by less than 200 spammers
Introduction

 Purpose of Spam

    Advertisements

    Pyramid schemes (Multi-Level Marketing)

    Giveaways

    Chain letters

    Political email

    Stock market advice
Introduction

 Spam as a problem
      Consumes   computing resources and time
      Reduces the effectiveness of legitimate advertising
      Cost Shifting
      Fraud
      Identity Theft
      Consumer Perception
      Global Implications


 John Borgan [ReplyNet] – “Junk email is not just annoying anymore.
  It’s eating into productivity. It’s eating into time”
Some Statistics

       Cost of Spam 2009:
       $130 billion worldwide
       $42 billion in US alone
       30% increase from 2007 estimates
       100% increase in 2007 from 2005

 Main components of cost:
   Productivity loss from inspecting and deleting spam missed by spam
    control products (False Negatives)
   Productivity loss from searching for legitimate mails deleted in error
    by spam control products (False Positives)
       Operations and helpdesk costs (Filters and Firewalls – installment
        and maintenance)
Some Statistics
Some Statistics

   Spam Averages about 88% of all emails sent
       Barracuda Networks:
Introduction

 Federal Regulations (CAN-SPAM act of 2003)

    Controlling the Assault of Non-Solicited Pornography And
     Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003



    Commercial email must adhere to three types of compliance
        Unsubscribe Compliance

      Content Compliance

      Sender   Behavior Compliance
Introduction

 Email Address Harvesting - Process of obtaining
 email addresses through various methods

    Purchasing /Trading lists with other spammers

    Bots

    Directory harvest attack

    Free Product or Services requiring valid email address

    News bulletins /Forums
Spam Life Cycle
Types of Spam Filters


   Header Filters
       Look at email headers to judge if forged or not
       Contain more information in addition to recipient , sender and subject
        fields




   Language Filters
       filters based on email body language
       Can be used to filter out spam written in foreign languages
Types of Spam Filters


   Content Filters
     Scan the text content of emails
     Use fuzzy logic


   Permission Filters
       Based on Challenge /Response system

   White list/blacklist Filters
     Will only accept emails from list of “good email addresses”
     Will block emails from “bad email addresses”
Types of Spam Filters


   Community Filters
     Work on  the principal of "communal knowledge" of spam
     These types of filters communicate with a central server.




   Bayesian Filters
     Statistical email filtering
     Uses Naïve Bayes classifier
Spam Filters Properties

   Filter must prevent spam from entering inboxes

   Able to detect the spam without blocking the ham
     Maximize    efficiency of the filter

   Do not require any modification to existing e-mail protocols

   Easily incremental
       Spam evolve continuously

       Need to adapt to each user
Data Mining and Spam Filtering

What is Data Mining?
 Predictive analysis of data for the discovery of trends and patterns
 Reveals critical insights into unstructured data.


Why Data Mining?
 The amount of data is increasing every year.
 To convert the acquired data into information and knowledge and use it in
  decision making.

How it is done?
 Utilizes computing power to apply sophisticated algorithms for knowledge
  discovery of data.

The Target…
 Spam Filtering using Data Mining
Data Mining and Spam Filtering


 Spam Filtering can be seen as a specific text
 categorization (Classification)



 History
       Jason Rennie’s iFile (1996); first know program to use Bayesian
        Classification for spam filtering
Bayesian Classification

 Particular words have particular probabilities of
 occurring in spam email and in legitimate email
  For instance, most email users will frequently encounter the word
   "Viagra" in spam email, but will seldom see it in other email

 The filter doesn't know these probabilities in
 advance, and must first be trained so it can build
 them up

 To train the filter, the user must manually indicate
 whether a new email is spam or not
Bayesian Classification

 For all words in each training email, the filter will
 adjust the probabilities that each word will appear in
 spam or legitimate email in its database
  For instance, Bayesian spam filters will typically have learned a very
   high spam probability for the words "Viagra" and "refinance", but a
   very low spam probability for words seen only in legitimate email,
   such as the names of friends and family members

 After training, the word probabilities are used to
 compute the probability that an email with a
 particular set of words in it belongs to either
 category
Bayesian Classification

 Each word in the email contributes to the email's
 spam probability, or only the most interesting words

 This contribution is computed using Bayes'
 theorem

 Then, the email's spam probability is computed
 over all words in the email, and if the total exceeds
 a certain threshold (say 95%), the filter will mark
 the email as a spam.
Bayesian Classification

 The initial training can usually be refined when
  wrong judgments from the software are identified
  (false positives or false negatives)
  That allows the software to dynamically adapt to the ever evolving
   nature of spam

 Some spam filters combine the results of both
  Bayesian spam filtering and other heuristics (pre-
  defined rules about the contents, looking at the message's
  envelope, etc.)
  Resulting   in even higher filtering accuracy
Computing the Probability


1. Computing the probability that a message
  containing a given word is spam:

 Let's suppose the suspected message contains the
  word "replica“
     Most people who are used to receiving e-mail know that this
      message is likely to be spam

 The formula used by the software for computing
Computing the Probability




where:
           is the probability that a message is a spam, knowing that the
    word "replica" is in it;
       is the overall probability that any given message is spam;
           is the probability that the word "replica" appears in spam
    messages;
        is the overall probability that any given message is not spam (is
    "ham");
            is the probability that the word "replica" appears in ham
    messages.
Computing the Probability


2. Spam or ham:
 Recent statistics show that current probability of
 any message to be spam is 80%, at the very least:

 However, most spam detection software are “not
 biased”, meaning that they have no prejudice
 regarding the incoming email


     It is advisable that the datasets of spam and ham are of same size
Computing the Probability


2. Spam or ham:



 This quantity is called "spamicity" (or "spaminess")
 of the word "replica“
     Pr(W|S) is approximated to the frequency of messages containing
      "replica" in the messages identified as spam during the learning
      phase
     Pr(W|H) is approximated to the frequency of messages containing
      "replica" in the messages identified as ham during the learning
      phase
Computing the Probability


3. Combining individual probabilities:
 The Bayesian spam filtering software makes the "naïve"
  assumption that the words present in the message
  are independent events

 With that assumption,



where:
 p is the probability that the suspect message is spam
 pi is the probability p(S|Wi)
Pros & Cons

Advantages
 It can be trained on a per-user basis
    The spam that a user receives is often related to the
     online user's activities
Disadvantages
 Bayesian spam filtering is susceptible
  to Bayesian poisoning
    Insertion of random innocuous words that are not
     normally associated with spam
    Replace text with pictures
    Google, by its Gmail email system, performing an
     OCR to every mid to large size image, analyzing the
     text inside
Pros & Cons
Conclusions


 Spam emails not only consume computing resources,
 but can also be frustrating

 Numerous detection techniques exist, but none is a
 “good for all scenarios” technique

 Data Mining approaches for content based spam
 filtering seem promising
Thank You / Questions

More Related Content

What's hot

miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptxAnush90
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660syaidatulamirah
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...IRJET Journal
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmAkshay Pal
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filteringijtsrd
 
Spam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxSpam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxKunal Kalamkar
 
Diabetes prediction using machine learning
Diabetes prediction using machine learningDiabetes prediction using machine learning
Diabetes prediction using machine learningdataalcott
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentationnewsan2001
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Presentation2.pptx
Presentation2.pptxPresentation2.pptx
Presentation2.pptxWanderer20
 
Seminar On Naive Bayes for Spam Filtering
Seminar On Naive Bayes for Spam Filtering Seminar On Naive Bayes for Spam Filtering
Seminar On Naive Bayes for Spam Filtering Asrarulhaq Maktedar
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAbhishek Singh
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentationRamandeep Kaur Bagri
 

What's hot (20)

miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes Algorithm
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filtering
 
Spam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxSpam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptx
 
Diabetes prediction using machine learning
Diabetes prediction using machine learningDiabetes prediction using machine learning
Diabetes prediction using machine learning
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentation
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Final Report(SuddhasatwaSatpathy)
Final Report(SuddhasatwaSatpathy)Final Report(SuddhasatwaSatpathy)
Final Report(SuddhasatwaSatpathy)
 
FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Presentation2.pptx
Presentation2.pptxPresentation2.pptx
Presentation2.pptx
 
Seminar On Naive Bayes for Spam Filtering
Seminar On Naive Bayes for Spam Filtering Seminar On Naive Bayes for Spam Filtering
Seminar On Naive Bayes for Spam Filtering
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
 
spam_msg_detection.pdf
spam_msg_detection.pdfspam_msg_detection.pdf
spam_msg_detection.pdf
 

Similar to Spam Filtering

Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filteringChippy Thomas
 
How to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkHow to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkGFI Software
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1Dhara Shah
 
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSAN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSijsrd.com
 
Network paperthesis1
Network paperthesis1Network paperthesis1
Network paperthesis1Dhara Shah
 
Seminar on web mail filter
Seminar   on   web mail filterSeminar   on   web mail filter
Seminar on web mail filterShalini Gs
 
Spam filtering.pptx
Spam filtering.pptxSpam filtering.pptx
Spam filtering.pptxAkfeteAssefa
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam FilteringIOSR Journals
 
2010 Spam Filtered World Fv
2010 Spam Filtered World Fv2010 Spam Filtered World Fv
2010 Spam Filtered World Fvcactussky
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamAlexander Decker
 

Similar to Spam Filtering (20)

DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERINGDEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
 
SPAM FILTERS
SPAM FILTERSSPAM FILTERS
SPAM FILTERS
 
Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filtering
 
402 406
402 406402 406
402 406
 
How to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkHow to Keep Spam Off Your Network
How to Keep Spam Off Your Network
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
Email deliverability
Email deliverabilityEmail deliverability
Email deliverability
 
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSAN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERS
 
Network paperthesis1
Network paperthesis1Network paperthesis1
Network paperthesis1
 
Seminar on web mail filter
Seminar   on   web mail filterSeminar   on   web mail filter
Seminar on web mail filter
 
Spam filtering.pptx
Spam filtering.pptxSpam filtering.pptx
Spam filtering.pptx
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
 
What is SPAM?
What is SPAM?What is SPAM?
What is SPAM?
 
2010 Spam Filtered World Fv
2010 Spam Filtered World Fv2010 Spam Filtered World Fv
2010 Spam Filtered World Fv
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispam
 
B0940509
B0940509B0940509
B0940509
 

More from Umar Alharaky

Function Point Counting Practices
Function Point Counting PracticesFunction Point Counting Practices
Function Point Counting PracticesUmar Alharaky
 
CMMI for Development
CMMI for DevelopmentCMMI for Development
CMMI for DevelopmentUmar Alharaky
 
Generalized Stochastic Petri Nets
Generalized Stochastic Petri NetsGeneralized Stochastic Petri Nets
Generalized Stochastic Petri NetsUmar Alharaky
 
Simulation Tracking Object Reference Model (STORM)
Simulation Tracking Object Reference Model (STORM)Simulation Tracking Object Reference Model (STORM)
Simulation Tracking Object Reference Model (STORM)Umar Alharaky
 

More from Umar Alharaky (6)

Function Point Counting Practices
Function Point Counting PracticesFunction Point Counting Practices
Function Point Counting Practices
 
CMMI for Development
CMMI for DevelopmentCMMI for Development
CMMI for Development
 
Generalized Stochastic Petri Nets
Generalized Stochastic Petri NetsGeneralized Stochastic Petri Nets
Generalized Stochastic Petri Nets
 
Data integration
Data integrationData integration
Data integration
 
Simulation Tracking Object Reference Model (STORM)
Simulation Tracking Object Reference Model (STORM)Simulation Tracking Object Reference Model (STORM)
Simulation Tracking Object Reference Model (STORM)
 
Turing machine
Turing machineTuring machine
Turing machine
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Spam Filtering

  • 1. Spam Filtering Techniques UMAR M. A. ALHARAKY, B.SC. RAKAN RAZOUK, PH.D.
  • 2. Introduction  What is spam?  Spam also called as Unsolicited Commercial Email (UCE)  Involves sending messages by email to numerous recipients at the same time (Mass Emailing).  Grew exponentially since 1990 but has leveled off recently and is no longer growing exponentially  80% of all spam is sent by less than 200 spammers
  • 3. Introduction  Purpose of Spam  Advertisements  Pyramid schemes (Multi-Level Marketing)  Giveaways  Chain letters  Political email  Stock market advice
  • 4. Introduction  Spam as a problem  Consumes computing resources and time  Reduces the effectiveness of legitimate advertising  Cost Shifting  Fraud  Identity Theft  Consumer Perception  Global Implications  John Borgan [ReplyNet] – “Junk email is not just annoying anymore. It’s eating into productivity. It’s eating into time”
  • 5. Some Statistics  Cost of Spam 2009:  $130 billion worldwide  $42 billion in US alone  30% increase from 2007 estimates  100% increase in 2007 from 2005  Main components of cost:  Productivity loss from inspecting and deleting spam missed by spam control products (False Negatives)  Productivity loss from searching for legitimate mails deleted in error by spam control products (False Positives)  Operations and helpdesk costs (Filters and Firewalls – installment and maintenance)
  • 7. Some Statistics  Spam Averages about 88% of all emails sent  Barracuda Networks:
  • 8. Introduction  Federal Regulations (CAN-SPAM act of 2003)  Controlling the Assault of Non-Solicited Pornography And Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003  Commercial email must adhere to three types of compliance  Unsubscribe Compliance  Content Compliance  Sender Behavior Compliance
  • 9. Introduction  Email Address Harvesting - Process of obtaining email addresses through various methods  Purchasing /Trading lists with other spammers  Bots  Directory harvest attack  Free Product or Services requiring valid email address  News bulletins /Forums
  • 11. Types of Spam Filters  Header Filters  Look at email headers to judge if forged or not  Contain more information in addition to recipient , sender and subject fields  Language Filters  filters based on email body language  Can be used to filter out spam written in foreign languages
  • 12. Types of Spam Filters  Content Filters  Scan the text content of emails  Use fuzzy logic  Permission Filters  Based on Challenge /Response system  White list/blacklist Filters  Will only accept emails from list of “good email addresses”  Will block emails from “bad email addresses”
  • 13. Types of Spam Filters  Community Filters  Work on the principal of "communal knowledge" of spam  These types of filters communicate with a central server.  Bayesian Filters  Statistical email filtering  Uses Naïve Bayes classifier
  • 14. Spam Filters Properties  Filter must prevent spam from entering inboxes  Able to detect the spam without blocking the ham  Maximize efficiency of the filter  Do not require any modification to existing e-mail protocols  Easily incremental  Spam evolve continuously  Need to adapt to each user
  • 15. Data Mining and Spam Filtering What is Data Mining?  Predictive analysis of data for the discovery of trends and patterns  Reveals critical insights into unstructured data. Why Data Mining?  The amount of data is increasing every year.  To convert the acquired data into information and knowledge and use it in decision making. How it is done?  Utilizes computing power to apply sophisticated algorithms for knowledge discovery of data. The Target…  Spam Filtering using Data Mining
  • 16. Data Mining and Spam Filtering  Spam Filtering can be seen as a specific text categorization (Classification)  History  Jason Rennie’s iFile (1996); first know program to use Bayesian Classification for spam filtering
  • 17. Bayesian Classification  Particular words have particular probabilities of occurring in spam email and in legitimate email  For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email  The filter doesn't know these probabilities in advance, and must first be trained so it can build them up  To train the filter, the user must manually indicate whether a new email is spam or not
  • 18. Bayesian Classification  For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database  For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members  After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category
  • 19. Bayesian Classification  Each word in the email contributes to the email's spam probability, or only the most interesting words  This contribution is computed using Bayes' theorem  Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
  • 20. Bayesian Classification  The initial training can usually be refined when wrong judgments from the software are identified (false positives or false negatives)  That allows the software to dynamically adapt to the ever evolving nature of spam  Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre- defined rules about the contents, looking at the message's envelope, etc.)  Resulting in even higher filtering accuracy
  • 21. Computing the Probability 1. Computing the probability that a message containing a given word is spam:  Let's suppose the suspected message contains the word "replica“  Most people who are used to receiving e-mail know that this message is likely to be spam  The formula used by the software for computing
  • 22. Computing the Probability where:  is the probability that a message is a spam, knowing that the word "replica" is in it;  is the overall probability that any given message is spam;  is the probability that the word "replica" appears in spam messages;  is the overall probability that any given message is not spam (is "ham");  is the probability that the word "replica" appears in ham messages.
  • 23. Computing the Probability 2. Spam or ham:  Recent statistics show that current probability of any message to be spam is 80%, at the very least:  However, most spam detection software are “not biased”, meaning that they have no prejudice regarding the incoming email  It is advisable that the datasets of spam and ham are of same size
  • 24. Computing the Probability 2. Spam or ham:  This quantity is called "spamicity" (or "spaminess") of the word "replica“  Pr(W|S) is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase  Pr(W|H) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase
  • 25. Computing the Probability 3. Combining individual probabilities:  The Bayesian spam filtering software makes the "naïve" assumption that the words present in the message are independent events  With that assumption, where:  p is the probability that the suspect message is spam  pi is the probability p(S|Wi)
  • 26. Pros & Cons Advantages  It can be trained on a per-user basis  The spam that a user receives is often related to the online user's activities Disadvantages  Bayesian spam filtering is susceptible to Bayesian poisoning  Insertion of random innocuous words that are not normally associated with spam  Replace text with pictures  Google, by its Gmail email system, performing an OCR to every mid to large size image, analyzing the text inside
  • 28. Conclusions  Spam emails not only consume computing resources, but can also be frustrating  Numerous detection techniques exist, but none is a “good for all scenarios” technique  Data Mining approaches for content based spam filtering seem promising
  • 29. Thank You / Questions