How we did it
How we tested the 16 spam filters.
We evaluated enterprise-level anti-spam products by installing them in a production e-mail environment for approximately one month. Most products were installed on servers at our test lab; the four off-site anti-spam services were tested using a 24M bit/sec Internet connection. To accommodate simultaneous testing, a dual-processor Intel server with 1.4-GHz CPUs and 3G-bytes of memory was used with VMWare's GSX server to create a different virtual machine for each product. Two products (Tumbleweed and Corvigo) ship on appliances and came with their own servers; two others (MailFrontier and SurfControl) ran on other systems because deadlines did not allow us the time to install them on our own hardware.
In the first part of the testing, we evaluated how each product does one important job: filtering spam. We took the tester's Opus One incoming mail stream and simultaneously re-fed it to each of the test systems in very close to real time. This meant that each product was seeing the same spam more-or-less simultaneously, and, more importantly, was seeing it as it flowed into our network from the Internet. Sending canned (old) spam would have been a lot easier, but would have been a poor test because it's a lot easier to filter out old spam than new spam. Each product was connected to the Internet and got spam signature updates as often as the vendor recommended.
Our goal was to get approximately 10,000 messages for a good statistical sampling, so we let this test run for most of June. Our actual total was more than 12,000 messages, which we reduced to 11,324 messages to fall on midnight boundaries.
We then looked at every, single one of those messages and divided them into one of three categories: spam, not spam, and don't know. We defined as spam the 7,840 messages for which there was no conceivable business or personal relationship between sender and receiver, and which was obviously bulk in nature. In the not spam category were 3,648 mail messages which may or may not have been solicited - they either had a clear business or personal relationship between sender and receiver, or was obviously a one-to-one message, even if unsolicited and unwanted. All mailing lists that had legitimate subscriptions were considered not spam. We didn't make any mailing list changes during the test duration.
In the don't know category were only 16 messages that we couldn't figure out - messages that looked like spam, but could have been the result of a legitimate business connection. For example, a few of these messages were press releases for products that weren't relevant to Opus One's business, but might have originated from an overzealous PR agency that Opus One communicates with.
We deleted the don't know messages from our test sample, leaving us a final count of 11,308. Then we compared the "correct" results with actual results for each product and came up with two scores: a false-positive rate and a sensitivity level. In a perfect world, the false-positive rate of any product would be zero: It measures the level at which one of these products marks non-spam messages as spam. Similarly, the ideal sensitivity level of each product would be 100%: All spam messages properly marked as spam and diverted away from the end user. Because a corporate mail system would likely be much more sensitive to false positives, we asked each vendor for advice in tuning their product to reach a false-positive rate of less than 1%. (see Spam and Statistics)
The second part was performance testing. We combined Spirent's WebAvalanche testing tool with an e-mail database of 10,000 messages from the Internet Mail Consortium. Spirent made a new version of its software available to us, which let us provide message content that we controlled, and we happily uploaded our message database to the WebAvalanche. We asked vendors at the start of the test to size their systems for 10 messages per second (approximately 1 million messages per day), so we dialed the WebAvalanche traffic generator up to 20 messages per second to put stress on the systems. Because we wanted to test how much the spam evaluation function slowed down each system, we dumped the messages in as quickly as the systems would take them (up to 20 messages per second), streaming messages through 10 simultaneous connections to each systems. We then timed how long it took each system to process and return the 10,000 messages.
For systems in the lab (all but the services), we reported how quickly the systems could accept our pile of messages, process them, and return them. For the services, we felt that reporting an absolute speed was going to be more a measure of how fast the Internet was between Opus One and each service, rather than how fast the service was (accept rates on all the services were between 3.6 and 6 messages per second). For the services, then, we only showed the relative slowdown on messages coming back. Our guess was that the Internet would cancel out in both direction, so a service that accepted at 4.4 messages per second, for example, should be able to send them back at the same rate. Anything less would be indicative of a slowdown in the system somewhere that would be worth reporting.
Our final part of testing was subjective - how well the products met the needs of an enterprise-level mail system. Features such as whitelisting, per-user tuning capabilities, variety of thresholds and management interfaces were all judged on how well an enterprise would benefit.
Copyright © 2003 IDG Communications, Inc.