Bayesian Filter

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Printable Version
Send Mail Feedback
Save Permalink URL

Bayesian Filter

The Bayesian Filter is a component used by the VPOP3 spam filter.

Bayesian filters are widely used in spam filtering - for instance, the spam filters built into many email clients use Bayesian filtering.

In the VPOP3 spam filter, messages are tested using the Bayesian filter, and the resulting rating affects the spam score using the Bayes50, Bayes80, Bayes90 and Bayes99 rules, depending on whether the Bayesian filter thinks the message was at least 50%, 80%, 90% or 99% likely to be spam.

The way a Bayesian filter works is quite complex and there are good articles on the Internet. One which made Bayesian filtering popular for spam filtering is by Paul Graham, and Wikipedia has a good article on the subject. It is sometimes called a naïve Bayesian filter. The word 'naïve' is not a criticism of the method, just that it is a purely statistical approach without any attempt at being 'clever'.

Although a complete description is complex, the basic mechanism is explained below.

First the Bayesian filter is initialised with a set of data of both known good and bad email messages. How many times each word (or 'term') appears in good or bad messages is tracked, as well as how many good and bad messages in total there are. This data can be used to determine how 'spammy' a given word is likely to be.

For instance, our set of Bayesian data shows that the word 'replica' is found in 2.8% of spam, but only 0.1% of non-spam, so that means that the word can be calculated to be 'very spammy' because it is 28 times more likely to be found in spam than not spam.

There are less obvious examples, for instance a quick look through our Bayesian data shows that the word 'click' is found is about 30% of spam and 15% of non-spam, so it is twice as likely to be found in spam. On the other hand 'please' is in 34% of non-spam and 17% of spam, so twice as likely to be found in legitimate mail. The word 'can' is in 30% of non-spam and 19% of spam, so 50% more likely to be in legitimate mail, and so on.

When a message is received:

1.VPOP3 again breaks down the message into words and then looks to see how 'spammy' each word it finds is.

2.The VPOP3 Bayesian filter ignores any words it hasn't seen before at least 5 times (If it used rare words sooner then the results may be misleading).

3.Then, the Bayesian filter calculates how 'interesting' each word is. How interesting it is is determined by how far from '50% spammy' it is, for instance a word which is found in only spam or only non-spam is very interesting.

4.The Bayesian filter then throws away all words except for the 15 most 'interesting' words.

5.The filter then uses a formula to create an overall 'spamminess' value for the message based on the spamminess of these 15 most interesting words.

Because a Bayesian filter uses all the words (or 'terms') in a message it can sometimes be confusing. It does not work in the same way that a human would make the decision so it can seem a bit counter-intuitive. VPOP3 has a page where you can enter an email message and it will show details on the above steps to try to help you to understand how it works if you are interested.

As well as words in the message content, the VPOP3 Bayesian filter also processes message header data. In this case, it remembers it as '<header field>:<word>'. For instance the word 'viagra' in the message subject would be remembered as 'Subject:viagra'. This allows VPOP3 to also check for common sender addresses or even mail servers which are most often used for sending spam.

Training the Bayesian Filter

Bayesian filters need training with both good and bad messages so they can learn the probabilities of words being in either.

The VPOP3 Bayesian filter constantly trains itself using messages you manually mark as spam or not-spam. Also, if the VPOP3 spam filter detects a message is spam, it will train the Bayesian filter that it is spam, and if it doesn't detect it as spam or the message is sent by a local user, it will train the Bayesian filter that it is not-spam. This can lead to some reinforcement bias, but it makes it simpler for users to use. Otherwise there would have to be a strict regime of manually sorting out spam and not-spam and training the filter with significant amounts of each. If you want to turn off this self-training, then you can set the UpdateBayes value to '0' in the script configuration settings.

You can also manually train the filter. If you send an unfiltered spam message to spam@<your local domain>, the spam filter will catch it and unlearn that it was good, and learn that it was bad. When you release a spam message from the spam filter quarantine or send it to notspam@<your local domain>, the spam filter will unlearn that it was bad, and learn that it was good.

Also, when users send messages to VPOP3 (either for local or external recipients), VPOP3 learns that that message is good (assuming that you don't send out spam).

If you think this help topic could be improved, please send us constructive feedback