Skip Links

Network World

  • Social Web 
  • Email 
  • Close

Spam and statistics

By Joel Snyder , Network World , 09/15/2003
Newsletter Signup
  • Share/Email
  • Tweet This
  • Comment
  • Print

Say false positive, and you immediately dive into a tough world - statistics of diagnostic tests. The terms false positive and false negative (and their cousins, true positive and true negative) are fairly easy to define. But turning the number of false positives and false negatives into easy-to-digest statistics is different, because the anti-spam community has not come to any agreement on which numbers to use across products.

A spam filter is a diagnostic test. For some set of thresholds, it will say "this is spam" or "this is not spam." In our testing, we didn't expose those thresholds. Instead, we asked the vendors to pick thresholds such that the false-positive rate would be kept to less than 1%. Interestingly enough, none of the vendors asked what we meant when we asked for false-positive rate. Based on your tolerance for false negatives (spam in your mailbox) or false positives (mail mismarked as spam, lost or delayed), you might want to set these thresholds differently.

Four main statistics are used to describe diagnostic tests. Positive predictive value (PPV) and negative predictive value (NPV) go together. They measure how likely the test is to be correct. PPV measures the probability that a message actually is spam, given that the test says that it is. PPV is computed by dividing the number of true positives by the sum of true positives and false positives. However, PPV doesn't say how much spam will be filtered out: The number of missed spam doesn't figure into that statistic at all.

Sensitivity and specificity are the other two statistics, sometimes called the true positive rate and true negative rate. They measure how likely a test is to catch whatever is being tested. Sensitivity, for example, measures the probability that a message will test as spam, given that it actually is spam. Sensitivity is computed by dividing the number of true positives by the sum of true positives and false negatives. Most research on diagnostic tests uses PPV and NPV or sensitivity and specificity to describe how well a test works because these are well-defined statistics.

The term false-positive rate is, unfortunately, not commonly defined or agreed on. For some people, the false-positive rate is the proportion of those cases that test positive but that are actually not spam. That is, it's the complement of the PPV. For others, false-positive rate is the proportion of the total sample (i.e., all mail messages) that is not spam, but test positive as spam. That is, it's the complement of the relative specificity. Rather than pick an ambiguous definition, we focused on things that made sense in the world of spam and didn't overlap each other in definition.

  • Share/Email
  • Tweet This
  • Comment
  • Print
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a NetworkWorld account? Log in here. Register now for a free account.

Videos

rssRss Feed