Math to fight spam

The Rev. Thomas Bayes (1702-1761) is best known for his paper published posthumously in the Philosophical Transactions of the Royal Society of London in 1763 titled "Essay Towards Solving a Problem in the Doctrine of Chances".

Lest you think that you've walked into History 101 let us assure you that we are merely keeping our word. Last week we promised to elucidate Bayesian filtering, a technique used for getting rid of spam and Bayes was the discoverer of Bayes Theorem upon which Bayesian filtering is based.

Bayes Theorem is a way of calculating the probability that an event will occur based on the number of times that event has occurred in previous trials. The theorem states that for events X and Y, the probability of X given that Y has happened (denoted by p{ X | Y } ) equals the probability of Y given that X ( p{ Y | X } ) has happened times the probability of X happening ( p{ X } ) divided by the probability of Y happening ( p{ Y } ). To put that another way:

p{ X | Y } =

p{ X } * p{ Y | X }

p{ Y }

Or, more generally,

p{ Xi | Y } =

p{ Xi } * p{ Y | Xi }

(p{ X1 } * p{ Y | X1 } ) + ... + ( p{ Xi } * p{ Y | Xi } ) + ... + ( p{ Xn } * p{ Y | Xn })

Clear? No? OK, let's apply this to the IT world. Let's say we maintain a software package with three configuration options - option A is used by 40% of our users, option B by 30% and option C by 30% (users can only use one option at a time).

If we assume that each option raises the same percentage of support requests (say, 1% of the number of users of that option) then we would obviously want to focus our effort in improving software quality according to which option has the greatest number of users, which would mean that option A is our focus after which we could start to polish either B or C.

But the percentages of support requests are a guess at this point. As we accumulate experience supporting this product we find out that 0.5% of A users have problems, 0.75% of B users and 0.95% of C users. Now where should we apply our efforts?

Let's find what the probability of a problem being caused by Option A (denoted by

p{ A | problem }) actually is.

According to Bayes:

p{ A | problem } =

p{ A } * p{ problem | A }

(p{A}*p{ problem | A })+(p{B}*p{problem|B})+(p{C}*p{problem|C})

Here, p{ A } equals 40%; p{ B } equals 30%; and p{ C } equals 30%. From our support experience we know that p{ problem | A } equals 0.5%, while p{ problem | B } equals 0.75% and p{ problem | C } equals 0.95%, so we get:

p{ A | problem } =0.4 * 0.005
= 0.2817 = 28%
(0.4 * 0.005) + (0.3 * 0.0075) + (0.3 * 0.0095)

Doing the same calculation for the other options we find that p{ B | problem } equals 32% and p{ C | problem } equals 40%.

Now we know that we should put our efforts into improving Option C. Cool, eh? And let's say we fix Option C so that p{ problem | C } equals 0.05%. This now means p{ C | problem } equals 3% while p{ A | problem } becomes 45% and p{ B | problem } becomes 51%.

Bayes Theory is fascinating, powerful, and highly useful when you are dealing with multiple interdependent events. And for that reason, it has become a foundation of spam-filtering systems.

Next week, we'll discuss exactly how Bayes got involved in fighting spam.

Calculations to

Learn more about this topic


Set of tools (written in Python) to block spam based on Bayesian theory.

A plan for spam

Discusses the use of Bayesian filtering.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10