Skip Links

Network World

  • Social Web 
  • Email 
  • Close

(Comma separation for multiple addresses)
Your Message:

A Bayesian filter that nabs 99.75% of spam?

By Mark Gibbs , Network World , 09/29/2003
Gibbs
  • Share/Email
  • Tweet This
  • Comment
  • Print

We suspect that you had more than enough mathematics in the form of Bayes Theorem last week so this week we'll explain how it's used in what is called Bayesian filtering to remove spam (note that the same techniques also can be used for general text filtering).

One of the best and most accessible descriptions of Bayesian filtering is an article from August 2002 titled "A Plan for Spam" by Paul Graham.

The background to this work was an attempt to build a spam-proof Web-based mail reader written in a version of Lisp called Arc that Graham and his colleagues were developing (we have no idea what happened to this project).

Upfront Graham says that using the filter "we now miss less than 5 per 1,000 spams, with 0 false positives." Sounds good, and the amazing thing about the filter is that it really isn't that complicated. Actually, if you read Graham's first and second papers you'll see he added features to the second version that improved performance considerably.

To create the filter, sample sets of spam and non-spam mail were analyzed by breaking the text into tokens. Tokens are created from the text of a message using a few simple rules: Case is preserved; exclamation marks are retained; periods and commas are retained if they occur between two digits (to allow IP addresses and prices to be included); price ranges are recognized (such as $20 to $25) and turned into two separate tokens (for example, $20 and $25); and tokens that occur within header fields such as To:, From: and Subject:, or within URLs are prefaced with an identifier to let them be weighted separately from a body token (for example, "Enlargement" in the Subject: becomes "Subject*Enlargement" - the asterisk can be any character that is ignored in messages).

The frequency of all tokens in each set is counted, and Bayes Theorem is applied to derive another table of the probabilities associated with each token being used in a spam or a non-spam message. It is worth noting that Graham weights the non-spam tokens twice as heavily as spam tokens to minimize false positives - the extra weight actually makes the filter less sensitive to spam but safer for non-spam (on the other hand, the success Graham claims makes the decreased sensitivity irrelevant).

  • Share/Email
  • Tweet This
  • Comment
  • Print
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a NetworkWorld account? Log in here. Register now for a free account.

Videos

rssRss Feed