2. Introduction
What is spam?
Spam also called as Unsolicited Commercial Email (UCE)
Involves sending messages by email to numerous recipients at
the same time (Mass Emailing).
Grew exponentially since 1990 but has leveled off recently and
is no longer growing exponentially
80% of all spam is sent by less than 200 spammers
4. Introduction
Spam as a problem
Consumes computing resources and time
Reduces the effectiveness of legitimate advertising
Cost Shifting
Fraud
Identity Theft
Consumer Perception
Global Implications
John Borgan [ReplyNet] – “Junk email is not just annoying anymore.
It’s eating into productivity. It’s eating into time”
5. Some Statistics
Cost of Spam 2009:
$130 billion worldwide
$42 billion in US alone
30% increase from 2007 estimates
100% increase in 2007 from 2005
Main components of cost:
Productivity loss from inspecting and deleting spam missed by spam
control products (False Negatives)
Productivity loss from searching for legitimate mails deleted in error
by spam control products (False Positives)
Operations and helpdesk costs (Filters and Firewalls – installment
and maintenance)
7. Some Statistics
Spam Averages about 88% of all emails sent
Barracuda Networks:
8. Introduction
Federal Regulations (CAN-SPAM act of 2003)
Controlling the Assault of Non-Solicited Pornography And
Marketing (CAN-SPAM) Act of 2003 – signed on 12/16/2003
Commercial email must adhere to three types of compliance
Unsubscribe Compliance
Content Compliance
Sender Behavior Compliance
9. Introduction
Email Address Harvesting - Process of obtaining
email addresses through various methods
Purchasing /Trading lists with other spammers
Bots
Directory harvest attack
Free Product or Services requiring valid email address
News bulletins /Forums
11. Types of Spam Filters
Header Filters
Look at email headers to judge if forged or not
Contain more information in addition to recipient , sender and subject
fields
Language Filters
filters based on email body language
Can be used to filter out spam written in foreign languages
12. Types of Spam Filters
Content Filters
Scan the text content of emails
Use fuzzy logic
Permission Filters
Based on Challenge /Response system
White list/blacklist Filters
Will only accept emails from list of “good email addresses”
Will block emails from “bad email addresses”
13. Types of Spam Filters
Community Filters
Work on the principal of "communal knowledge" of spam
These types of filters communicate with a central server.
Bayesian Filters
Statistical email filtering
Uses Naïve Bayes classifier
14. Spam Filters Properties
Filter must prevent spam from entering inboxes
Able to detect the spam without blocking the ham
Maximize efficiency of the filter
Do not require any modification to existing e-mail protocols
Easily incremental
Spam evolve continuously
Need to adapt to each user
15. Data Mining and Spam Filtering
What is Data Mining?
Predictive analysis of data for the discovery of trends and patterns
Reveals critical insights into unstructured data.
Why Data Mining?
The amount of data is increasing every year.
To convert the acquired data into information and knowledge and use it in
decision making.
How it is done?
Utilizes computing power to apply sophisticated algorithms for knowledge
discovery of data.
The Target…
Spam Filtering using Data Mining
16. Data Mining and Spam Filtering
Spam Filtering can be seen as a specific text
categorization (Classification)
History
Jason Rennie’s iFile (1996); first know program to use Bayesian
Classification for spam filtering
17. Bayesian Classification
Particular words have particular probabilities of
occurring in spam email and in legitimate email
For instance, most email users will frequently encounter the word
"Viagra" in spam email, but will seldom see it in other email
The filter doesn't know these probabilities in
advance, and must first be trained so it can build
them up
To train the filter, the user must manually indicate
whether a new email is spam or not
18. Bayesian Classification
For all words in each training email, the filter will
adjust the probabilities that each word will appear in
spam or legitimate email in its database
For instance, Bayesian spam filters will typically have learned a very
high spam probability for the words "Viagra" and "refinance", but a
very low spam probability for words seen only in legitimate email,
such as the names of friends and family members
After training, the word probabilities are used to
compute the probability that an email with a
particular set of words in it belongs to either
category
19. Bayesian Classification
Each word in the email contributes to the email's
spam probability, or only the most interesting words
This contribution is computed using Bayes'
theorem
Then, the email's spam probability is computed
over all words in the email, and if the total exceeds
a certain threshold (say 95%), the filter will mark
the email as a spam.
20. Bayesian Classification
The initial training can usually be refined when
wrong judgments from the software are identified
(false positives or false negatives)
That allows the software to dynamically adapt to the ever evolving
nature of spam
Some spam filters combine the results of both
Bayesian spam filtering and other heuristics (pre-
defined rules about the contents, looking at the message's
envelope, etc.)
Resulting in even higher filtering accuracy
21. Computing the Probability
1. Computing the probability that a message
containing a given word is spam:
Let's suppose the suspected message contains the
word "replica“
Most people who are used to receiving e-mail know that this
message is likely to be spam
The formula used by the software for computing
22. Computing the Probability
where:
is the probability that a message is a spam, knowing that the
word "replica" is in it;
is the overall probability that any given message is spam;
is the probability that the word "replica" appears in spam
messages;
is the overall probability that any given message is not spam (is
"ham");
is the probability that the word "replica" appears in ham
messages.
23. Computing the Probability
2. Spam or ham:
Recent statistics show that current probability of
any message to be spam is 80%, at the very least:
However, most spam detection software are “not
biased”, meaning that they have no prejudice
regarding the incoming email
It is advisable that the datasets of spam and ham are of same size
24. Computing the Probability
2. Spam or ham:
This quantity is called "spamicity" (or "spaminess")
of the word "replica“
Pr(W|S) is approximated to the frequency of messages containing
"replica" in the messages identified as spam during the learning
phase
Pr(W|H) is approximated to the frequency of messages containing
"replica" in the messages identified as ham during the learning
phase
25. Computing the Probability
3. Combining individual probabilities:
The Bayesian spam filtering software makes the "naïve"
assumption that the words present in the message
are independent events
With that assumption,
where:
p is the probability that the suspect message is spam
pi is the probability p(S|Wi)
26. Pros & Cons
Advantages
It can be trained on a per-user basis
The spam that a user receives is often related to the
online user's activities
Disadvantages
Bayesian spam filtering is susceptible
to Bayesian poisoning
Insertion of random innocuous words that are not
normally associated with spam
Replace text with pictures
Google, by its Gmail email system, performing an
OCR to every mid to large size image, analyzing the
text inside
28. Conclusions
Spam emails not only consume computing resources,
but can also be frustrating
Numerous detection techniques exist, but none is a
“good for all scenarios” technique
Data Mining approaches for content based spam
filtering seem promising