• United States

Antispam: how we did it

Feb 24, 20036 mins

We put these products through their paces by taking Opus One’s mail stream and replicating it in real-time at the gateway. One copy of the stream went to our normal mail server, and the other two copies were sent in real-time to the MailFrontier and Cloudmark gateways. All testing was done using our Cubix Density system with 533-MHz Intel CPUs with 512M bytes of memory. MailFrontier then fed the mail to a Microsoft Exchange 2000 Server, while Cloudmark’s relay fed on to a POP/IMAP mail server. This system ensured that each system saw the mail as it came in, without any appreciable delay.

Approximately 3,000 messages passed through the system during the seven days of the test. We configured the Cloudmark gateway to tag all messages so that we could see what the score of each was. MailFrontier also tagged all messages using its three-level scale.

MailFrontier’s technique to reduce false positive is to preload its system with the contact list and addresses learned from sent mail. We took one month’s worth of sent mail and let the MailFrontier user profiler use that pile of mail to build an initial whitelist. MailFrontier also asked that we use Microsoft Outlook with its user profiler and Microsoft Exchange with its corporate profiler to continue to refine the whitelist. That would have been difficult in this test scenario. Instead, we noted any false positives for spam that might have been prevented by the week’s worth of profile information. There were none to worry about in the short testing period we used. MailFrontier also has some fine-tuning knobs for the sensitivity of its server, but the company told us that it advises most customers to leave them alone, so that’s what we did.

We examined every message, classifying it ourselves as “spam”, meaning unsolicited commercial e-mail; “not spam”, which included mailing lists in which we participate and normal correspondence; and, “can’t tell”, for mail that might not have been unsolicited but looked and smelled like spam anyway. Because we couldn’t tell, we figured that neither Cloudmark nor MailFrontier could either, so we dropped those dozen messages out of the picture.

Then we logged the Cloudmark tag (a number from 0 to 100) and the MailFrontier tag (either “Junk”, “Maybe Junk” or nothing) with each message.

We then tried to project how most companies would be using these products. Because the goal is to filter the messages before they get to end users, we hypothesized that a company would take some messages that it is pretty sure are spam and quarantine them at the gateway. Messages that it’s not so sure about, would be sent on but marked so that the user could put them into a separate folder and review the messages – but get them out of the day-to-day mail flow. Messages that were definitely not spam would be sent on directly. Because filtering messages is the only sensible way to use these products, that means that false positives represent real, business e-mail that got “lost” along the way – casualties of the war on spam.

For MailFrontier, those three categories were easy to map to the product because MailFrontier only has three ways to mark a message: spam, maybe spam, or not spam. For Cloudmark, it was much more difficult. Cloudmark’s 100-point scale gives a lot of possibilities. After consultation with the Cloudmark team, we put out three sets of numbers for Cloudmark: 50/80, 70/95 and 80/98, essentially simulating running the same test three times. The 50/80 tag means that any message with a score less than 50 was not spam, between 50 and 80 was maybe spam and above 80 was spam. Likewise, 70/95 means that a score less than 70 was not spam, between 70-95 was maybe spam and above 95 was spam.

Using some quick-and-dirty programming, we then generated two critical statistics for each product: false positive rate and overall spam reduction rate.

False positives are messages that are not spam, but were marked as spam. We calculated the false positive rate as the number of messages marked as spam divided by the total number received or 3,090. For example, MailFrontier identified 32 messages as spam that were not, so its false positive rate is 1.0% (32 divided by 3,090). False positives are particularly bad with these two products because a message that is filtered at the gateway can never be retrieved by the user (at least in this version of each product). We estimate that a false positive rate of 1.0% is really an absolute maximum, and numbers closer to 0.5% or 0.1% would be more reasonable.

We also calculated false negatives: messages that are spam, but were not marked as such. Although everyone wants to reduce false negatives, some are inevitable in any system such as this. We thought that a false negative rate in the range of 10% to 20% would be acceptable, although the lower, the better.

We kept track of “maybe” numbers as well, both for false positives and false negatives, representing the overhead of an additional folder that the end user would have to periodically scan. In both products, there is a band of uncertainty. For example, with MailFrontier, messages marked as “maybe spam” would fall into this category. The goal is to keep this folder as small as possible, reducing the load on the end user at having to look at and dispose of messages. If the “maybe” folder got too big with too much non-spam in it, then the products would not be doing their jobs very well.

We then calculated an overall “spam reduction” value, or in other words the spam that we didn’t have to look at. We determined that value by taking the total spam in our sample (1,530) and subtracting the false negatives (because we would have had to look at them immediately) and the false “maybes” (because we would have had to look at them eventually). To simplify things, we expressed spam reduction as a percentage. Thus, MailFrontier had a spam reduction rate of 86.1%, because it had 130 false negatives and 82 “maybes” that were really spam. MailFrontier properly marked the rest of the 1,530 spam as “spam”. Thus, 1,530 divided by 1,318 (1,530-130-82) gives 86.1%.