How we did it
By
Joel Snyder
,
Network World
, 09/15/2003
- Share/Email
- Tweet This
- Print
We evaluated enterprise-level anti-spam products by installing them in a production e-mail environment for approximately one
month. Most products were installed on servers at our test lab; the four off-site anti-spam services were tested using a 24M
bit/sec Internet connection. To accommodate simultaneous testing, a dual-processor Intel server with 1.4-GHz CPUs and 3G-bytes
of memory was used with VMWare's GSX server to create a different virtual machine for each product. Two products (Tumbleweed and Corvigo) ship on appliances and came with their own servers; two others (MailFrontier and SurfControl) ran on other systems because deadlines did not allow us the time to install them on our own hardware.
In the first part of the testing, we evaluated how each product does one important job: filtering spam. We took the tester's
Opus One incoming mail stream and simultaneously re-fed it to each of the test systems in very close to real time. This meant
that each product was seeing the same spam more-or-less simultaneously, and, more importantly, was seeing it as it flowed
into our network from the Internet. Sending canned (old) spam would have been a lot easier, but would have been a poor test
because it's a lot easier to filter out old spam than new spam. Each product was connected to the Internet and got spam signature
updates as often as the vendor recommended.
Our goal was to get approximately 10,000 messages for a good statistical sampling, so we let this test run for most of June.
Our actual total was more than 12,000 messages, which we reduced to 11,324 messages to fall on midnight boundaries.
We then looked at every, single one of those messages and divided them into one of three categories: spam, not spam, and don't
know. We defined as spam the 7,840 messages for which there was no conceivable business or personal relationship between sender
and receiver, and which was obviously bulk in nature. In the not spam category were 3,648 mail messages which may or may not
have been solicited - they either had a clear business or personal relationship between sender and receiver, or was obviously
a one-to-one message, even if unsolicited and unwanted. All mailing lists that had legitimate subscriptions were considered
not spam. We didn't make any mailing list changes during the test duration.
In the don't know category were only 16 messages that we couldn't figure out - messages that looked like spam, but could have
been the result of a legitimate business connection. For example, a few of these messages were press releases for products
that weren't relevant to Opus One's business, but might have originated from an overzealous PR agency that Opus One communicates
with.
We deleted the don't know messages from our test sample, leaving us a final count of 11,308. Then we compared the "correct"
results with actual results for each product and came up with two scores: a false-positive rate and a sensitivity level. In
a perfect world, the false-positive rate of any product would be zero: It measures the level at which one of these products
marks non-spam messages as spam. Similarly, the ideal sensitivity level of each product would be 100%: All spam messages properly
marked as spam and diverted away from the end user. Because a corporate mail system would likely be much more sensitive to
false positives, we asked each vendor for advice in tuning their product to reach a false-positive rate of less than 1%. (see Spam and Statistics)
The second part was performance testing. We combined Spirent's WebAvalanche testing tool with an e-mail database of 10,000 messages from the Internet Mail Consortium. Spirent made a new version of its software available to us, which let us provide message content that we controlled, and
we happily uploaded our message database to the WebAvalanche. We asked vendors at the start of the test to size their systems
for 10 messages per second (approximately 1 million messages per day), so we dialed the WebAvalanche traffic generator up
to 20 messages per second to put stress on the systems. Because we wanted to test how much the spam evaluation function slowed
down each system, we dumped the messages in as quickly as the systems would take them (up to 20 messages per second), streaming
messages through 10 simultaneous connections to each systems. We then timed how long it took each system to process and return
the 10,000 messages.
For systems in the lab (all but the services), we reported how quickly the systems could accept our pile of messages, process
them, and return them. For the services, we felt that reporting an absolute speed was going to be more a measure of how fast
the Internet was between Opus One and each service, rather than how fast the service was (accept rates on all the services
were between 3.6 and 6 messages per second). For the services, then, we only showed the relative slowdown on messages coming
back. Our guess was that the Internet would cancel out in both direction, so a service that accepted at 4.4 messages per second,
for example, should be able to send them back at the same rate. Anything less would be indicative of a slowdown in the system
somewhere that would be worth reporting.
Our final part of testing was subjective - how well the products met the needs of an enterprise-level mail system. Features
such as whitelisting, per-user tuning capabilities, variety of thresholds and management interfaces were all judged on how
well an enterprise would benefit.
Comment