- 4chan hell raisers finding fame brings heat?
- The 10 dumbest mistakes network managers make
- NetApp quits bidding war in face of EMC opposition
- CompuServe closes after 30 years
- Google to launch open-source Chrome OS this year
We suspect that you had more than enough mathematics in the form of Bayes Theorem last week so this week we'll explain how it's used in what is called Bayesian filtering to remove spam (note that the same techniques also can be used for general text filtering).
One of the best and most accessible descriptions of Bayesian filtering is an article from August 2002 titled "A Plan for Spam" by Paul Graham.
The background to this work was an attempt to build a spam-proof Web-based mail reader written in a version of Lisp called Arc that Graham and his colleagues were developing (we have no idea what happened to this project).
Upfront Graham says that using the filter "we now miss less than 5 per 1,000 spams, with 0 false positives." Sounds good, and the amazing thing about the filter is that it really isn't that complicated. Actually, if you read Graham's first and second papers you'll see he added features to the second version that improved performance considerably.
To create the filter, sample sets of spam and non-spam mail were analyzed by breaking the text into tokens. Tokens are created from the text of a message using a few simple rules: Case is preserved; exclamation marks are retained; periods and commas are retained if they occur between two digits (to allow IP addresses and prices to be included); price ranges are recognized (such as $20 to $25) and turned into two separate tokens (for example, $20 and $25); and tokens that occur within header fields such as To:, From: and Subject:, or within URLs are prefaced with an identifier to let them be weighted separately from a body token (for example, "Enlargement" in the Subject: becomes "Subject*Enlargement" - the asterisk can be any character that is ignored in messages).
The frequency of all tokens in each set is counted, and Bayes Theorem is applied to derive another table of the probabilities associated with each token being used in a spam or a non-spam message. It is worth noting that Graham weights the non-spam tokens twice as heavily as spam tokens to minimize false positives - the extra weight actually makes the filter less sensitive to spam but safer for non-spam (on the other hand, the success Graham claims makes the decreased sensitivity irrelevant).
Comment