What is a false positive?

With spam, suddenly everyone cares about statistics. For the first time, system administrators are buying software that openly admits that it doesn't work all the time.

With spam, suddenly everyone cares about statistics. For the first time, system administrators are buying software that openly admits that it doesn't work all the time. Not only that, the percentages are pretty dismal. Would you buy a firewall that claims to work only 99% of the time? Or a compiler that advertises that it mis-compiles programs once every 1,000 times?

Of course, we know that with many software packages, there are going to be errors and that it won't work 100% of the time. We just don't base our buying decision on that percentage. Virus scanners don't work 100% of the time, but you don't pick a virus scanner based on published results of how often it fails.


Main index: Spam in the Wild, The Sequel


But that's the way we buy anti-spam products, and will continue to do so for at least the next few years, with spam-catch rate and error rate as all-important statistics in the buying process. At least that's what readers tell us. One thing we found in our test this year is that all products are not alike. Several vendors called us, claiming the opposite, and would prefer people evaluate their products based on all the other features they've worked so hard to include. That's nice, but until anti-spam products work as well as anti-virus products - and they don't - we will still test for accuracy.

If you consider that numbers are the single most important part of your buying decision, you should probably know what they mean. Since most of us forgot everything we knew about statistics a few hours after the final exam in college, we present this little reminder primer. Don't worry, there's no quiz at the end of the article.

The terms false positive and false negative (along with true positive and true negative) come to us from the world of diagnostic tests. An anti-spam product is like a pregnancy test - it eventually comes down to yes or no. False positive means the test said the message was spam, when in reality it wasn't. A false negative means that the test said a message was not spam, when in reality it was.

We often think in terms of error rates, but with many diagnostic tests the kind of error is a big deal. It's not enough to know that the test is wrong 29% of the time. We want to know what kind of wrong. Spam tests are exactly like that. A false positive means that good mail might have gotten lost, while a false negative is just annoying. We care more about false positives than we do about false negatives (unless the CEO is getting inundated with false negatives). In addition to wanting to know how many errors there are, we also want to know what type they are.

You also may want to adjust the behavior of the system, so we gave points to products that let you change its behavior. Based on your tolerance for false negatives (spam in the mailbox) or false positives (mail mis-marked as spam, lost or delayed), you may want to set the product to have different thresholds. In our test, we didn't expose those thresholds. Instead, we asked the vendors to pick thresholds and to tune their products such that the false-positive rate would be kept to less than 1% of all e-mail.

Once you decide that you want to keep false positives and false negatives separately, then you need to stick to your guns. This is what researchers who study other diagnostic tests do, and this is what you need to do to make the best buying decision. Unfortunately, the path now gets convoluted and more confusing.

Four main statistics are used to describe diagnostic tests. The Positive Predictive Value (PPV) and Negative Predictive Value (NPV) go together. They measure how likely the test is to be correct. PPV, for example, measures the probability that a message actually is spam, given that the test says it is. PPV is computed by dividing the number of true positives by the sum of true positives and false positives. However, PPV doesn't say how much spam will be filtered out: the number of missed spam doesn't figure into that statistic at all.

Sensitivity and specificity are the other two statistics, sometimes called the True Positive Rate and True Negative Rate. They measure how likely a test will catch whatever it is testing for. Sensitivity, for example, measures the probability that a message will test as spam, given that it actually is spam. Sensitivity is computed by dividing the number of true positives by the sum of true positives and false negatives. Most research on diagnostic tests use PPV and NPV or sensitivity and specificity to describe how well a test works because these are well-defined statistics.

We know that if we published statistics called PPV and specificity, people would get confused. So we tried to determine what would be most useful to readers. We boiled it down to two main questions. First, "How much spam will this product filter out?" That question is answered by the sensitivity statistic. It tells us what percentage of the time spam will be identified by the filter. A perfect score would be 100%. In our test sample, there were 8,027 spam messages. Barracuda caught 7,563 of those, and missed the rest. Forgetting the false positives (because that's a different question), Barracuda gave us a 94% reduction in the spam: 94 out of 100 spam messages are blocked.

The second question is "How accurate is the filter?" Accuracy is best answered by the PPV statistic. That tells us what percentage of the time the test filters out mail correctly. Again, a perfect score would be 100%, meaning that when the filter says something is spam, it is right 100% of the time. Because people like to talk about false-positive rate, we've taken the PPV and subtracted it from 1 to calculate a false-positive rate. In our Barracuda example, Barracuda was wrong 23 times, giving a PPV of 0.997, or a false positive rate of 0.3%.

Another way some researchers define false-positive rate is by subtracting the specificity from 1, or by dividing false positives by the sum of false positives and true negatives. These end up being the same value. For most products, these numbers are close, although they do measure different things.

Vendors have created another statistic by dividing the number of false positives by the total sample size. One reason for this is that it creates the smallest number they can report - by mixing up false positives, true positives, false negatives and true negatives in the statistic, the denominator gets big, which means that the result will be small. There is no statistical term for that number, although you will often see it in literature as "false-positive rate." For example, if we were to calculate Barracuda's number this way, it would be 0.2%. The difference between 0.3% and 0.2% seems small, but 0.3% is 50% larger than 0.2% - that's a big difference when the numbers are that small. The nice thing about the vendor-reported number is that it sweeps under the rug the fact that a low false positive rate is generally accompanied by a high false negative rate. An example here would be helpful. Suppose 75% of your e-mail is spam. Look at 100 messages, and 75 of them will be junk. Now, compare two anti-spam filters. One looks at 100 messages, says 2 of them are spam, and is wrong about one of them. The other product looks at 100 messages, says 76 are spam, and is wrong about one. If you simply divide the false positive count by 100, both have a 1% false-positive rate. But if you use an honest false-positive rate (such as the one we use), you see real differences. The first product has a false-positive rate of 50% (1 true positive divided by 2 messages called spam, subtracted from 1) (give the calculation here), while the second one (more accurate), has a false-positive rate of less than 2% (75 true positives divided by 76 messages called spam, subtracted from 1). If you wanted to calculate those numbers using the specificity statistic, the first product has a rate of 50%, while the second product has a rate of about 4%.

You can use whatever statistics you want to compare products, as long as you understand what the statistic is telling you, and compute it identically across all products. When a vendor reports false-positive rates and false negatives, you should ask them how they computed the rates. For most network managers, our statistics will give a strong feel for how good these products are at filtering out spam.

Related:

Copyright © 2004 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022