Outwitting spammers

Clever, self-learning methods such as Bayesian filters and neural networks are the latest anti-spam buzz. But some fall frustratingly short for corporate use.

Spam, the four-letter word of the virtual world, seems to be on everyone's lips these days, from rank-and-file workers to Congress and the British Parliament. The buzz occurs with good reason: Spam has leapt from a minor, annoying byproduct of e-mail to an epic business problem. Unsolicited e-mails are growing at a rate of 5% per month, according to a Kessler International survey. That means thousands of unwanted e-mails per week, often totaling 75% of the messages an enterprise e-mail gateway must process - while clogging downstream wires and servers, users say.

Spam on that scale also sucks up employee time; Nucleus Research reports that nuisance e-mail costs $874 per person annually in lost productivity (see how much it costs your company using our Spam Calculator). And with some messages so obscene as to make a merchant marine blush, much of the spam content is inappropriate for a business environment, if not outright illegal.

Government intervention has been discussed considerably as a solution, but network professionals aren't holding out for relief from legislation. Its effectiveness will be iffy at best (see story ). For the time being, exterminating the spam menace will remain the task of the network team. That's easier said than done. E-mail marketers constantly find ways to thwart existing e-mail filters. Anti-spam software vendors, in turn, create new filters intended to spot spammer's latest tricks. Not only must network executives frequently update software to get the latest filters, but the more filtering they switch on, the higher the chance that legitimate e-mail gets mislabeled and deleted as spam.

The latest crop of filters promises to stop this yo-yo cycle. These filters are based on "self-learning" or "machine-learning" technologies that attempt to adapt automatically to spammers' new tricks while protecting legitimate e-mail. Among machine-learning technologies in commercial spam filters, Bayesian filtering and neural networks are the most talked about, with Bayesian filtering generating a downright roar. In the past few months, this type of filter has been implemented in a growing number of anti-spam products, ranging from open source product SpamAssassin to an enterprise-class spam-detection module from start-up ProofPoint.

Users who have tried Bayesian filtering recommend it.

"I implemented SpamAssassin before Bayesian was part of it, and it worked pretty well, but with Bayesian, it works much better. Bayesian is essential in this day and age," says John Stewart, senior technical specialist for Artesyn Communication Products, in Madison, Wis.

But, unlike more established anti-spam technologies like dictionary scans, blacklisting and heuristics, today's buzzy machine-learning filters are not always a straightforward affair for a corporation. (See The Anti-Spam Glossary.)

Spam or not-spam

Bayesian filters are based on an algorithm for classifying documents, says Paul Graham, an independent programmer who created an early, open source Bayesian spam filter. Because all spammers must somehow state their message - despite any tricks they use to fake out filters - Bayesian's techniques for intelligently classifying content have proven effective.

First, the user divides e-mail into two piles, spam and not-spam, from which the filter trains itself. The filter analyzes every word in each e-mail and determines how frequently the word occurs in the spam pile vs. the not-spam pile. For instance, when the filter finds "V-1-A-G-R-A" in spam but never in not-spam, V1AGRA earns a 100% probability of being a spam "word." Because "the" occurs equally in spam and not-spam, it gains a neutral 50%, while innocent words such as RFP would occur in not-spam but rarely (if ever) in spam, giving it, for instance, a 99% probability of being a not-spam word.

When an e-mail arrives at the trained Bayesian filter, the filter looks for the 15 words with the highest probabilities - "either very guilty or very innocent," Graham says - and uses them to calculate the message's overall spam probability. Tricks like replacing the "i" with the numeral 1 in VIAGRA might confound the simple dictionary filter, but they help the Bayesian filter.

"If spammers say V1agra . . . my God is that guilty - even more than Viagra. Conceivably, people might be writing each other e-mails about Viagra, but they are not going to be writing about V-1-a-g-r-a," Graham says.

In most products, Bayesian filtering is only one of several tests in a heuristics process that determines the e-mail's overall spam probability. Once that's done, the anti-spam tool embeds a spam rating in the message header and then typically sends the e-mail on to the client's e-mail software, which uses the tag to sort and/or delete the message, per user instructions.

Should the filter err, calling a legitimate e-mail spam or questionable (the false positive) or tagging spam as legitimate (the false negative), the end user would send the falsely labeled message to the correct folder. The filter uses these folders to retrain itself daily, or per user-specified frequency. Regular training assures that the filter automatically learns the latest spammer tricks (such as garbage characters in the subject line and spaces between letters). Filtering is also personalized. The banker can accept mortgage offers from the competition as legitimate - while the office manager deletes them as spam.

Because Bayesian filtering was designed for the client, it is most commonly a feature of consumer products, with a price of about $30 per license. Vendors, no doubt, would be willing to negotiate less-expensive, volume prices for enterprise deployments. Many Bayesian filters also are available for free as open source tools.

Clogging the gateway

While Bayesian filtering's client-side bent is an elegant way to stop spam from sponging up productivity, its drawback is that it introduces major client-management headaches when used on an enterprise scale. That calls for some creativity.

For instance, Artesyn Communication's Stewart customized SpamAssassin so that it will filter e-mail at the mail gateway, not the client. Typically, SpamAssassin is a Unix product intended to run after the mail is delivered - with a Unix e-mail client or on the box that holds the point of presence and Internet Message Access Protocol accounts, Stewart says. He modified SpamAssassin so that mail can flow through it. SpamAssassin runs on a server he calls the "Spaminator," along with AMaVIS-new, an open source anti-virus scanner that includes a call to SpamAssassin; and PostFix, a Unix mail program. Custom PERL scripts handle various e-mail management tasks.

When an e-mail arrives, it is scanned for viruses and trotted through SpamAssassin's heuristics tests, which includes Bayesian filtering (in versions 2.5 or later). SpamAssassin tags the e-mail with a spam level and sends it onto the company's Microsoft Exchange servers, which deliver it into Outlook clients. If configured to do so, Outlook sorts the messages into various folders according to their spam level.

Stewart lets employees opt into the anti-spam program. On the intranet, he posted detailed instructions on setting up Outlook in-box rules to sort mail using the spam-level tags. If employees find mislabeled spam, Stewart asks that they forward it to a special folder for use in retraining the Bayesian filter.

Anti-spam and the feds

The government wants to play a role in curbing the spam problem
Surely, the various legislative proposals to control spam now circulating Congress won't end unwanted e-mail. But, once passed, they could be of help.

Click here for more

Management of Outlook clients has been a non-issue, he says: "Eighty-plus percent of the people who do this figured it out without any help from IT. If someone does have a problem, it's not that big of a deal to get IT staff out there to check out their rules."

So far, about 100 of Artesyn's 250 employees participate in the anti-spam program; many of these participants are engineers whose e-mail addresses are widely available from Usenet postings. Since he began filtering spam, Stewart has discovered that Artesyn is receiving between 3,000 and 4,000 unwanted e-mails a day, compared with 750 legitimate messages a week. SpamAssassin "literally changed my life," Stewart says, noting that he is no longer hobbled by the constant e-mail checks required when spam was pouring into his in-box. "There's just no other way to deal with spam than to have a filter."

By using open source code, putting a spare PC to use as the Spaminator server and exerting a little programming elbow grease, "Our total investment in this was zero - it was just my time," Stewart says.

Neural networks

Yet, Artesyn's creative adaptation only eliminates installing software on clients - and managing it thereafter. But it still ultimately deals with the spam at the client and so, like most Bayesian filtering products, does nothing to ease the stress all that unwanted e-mail places on network processors.

Artesyn's network can handle the load, so Stewart favors this approach as an insurance policy that legitimate mail isn't deleted as spam. "It's bothersome, but we've got the capacity to handle it. A false positive is much worse than a false negative. Sales doesn't want to lose a possible sale," he says. "If this was my personal mail, I would certainly just drop it at the gateway. But as a business, we need to make sure we're not getting rid of legitimate e-mail."

Bayesian-like technologies for products that operate at the e-mail gateway are what's needed for that, some vendors say. Joshua Elicio, information security officer for Memorial Health Center in Las Cruces, N.M., says stopping spam at the gateway is critically important. He doesn't want to drop legitimate e-mail, but doesn't want spam clogging up the network plumbing, either.

Anti-spam glossary

Network World’s anti-spam fighter, Peter Hebenstreit, recommends familiarity with these methods of stopping unwanted e-mail.
Attachment checking — Checks for macros and text in attachments.
Blacklists — Let users designate a source or IP address from which no mail will be accepted. Code checking — Looks for “open new window” or any other type of scripting that might be malicious.
Complex dictionary checking — Screens text for no-no words and won’t be fooled by various tricks, such as the replacement of letters with look-alike numerals (c001 = cool).
Content checking — Scans plain text for key phrases and the percent of HTML, images and other indications that the message is spam.
Header checking — Checks for valid Multi-purpose Internet Mail Extensions content, valid Simple Mail Transfer Protocol addresses and the like.
Heuristics — A checklist of rules and tests to determine mathematically the likelihood that the message is spam.
Machine-learning or self-learning — Methods, such as Bayesian filtering, by which the filter can create and update its own heuristics checklist.
Snapshotting or fingerprinting — Identifies that similar, yet not identical, messages are part of the same, already-identified spam broadcast.
Whitelists — Lets users designate a source or IP address from which all mail will be accepted, even if individual messages earn high spam ratings.
— Julie Bort

In March 2002, Elicio activated anti-spam filters in the anti-virus software he used, SurfControl's E-mail Filter, to stop spam at the gateway. Of the 4,600 e-mails the hospital received per week, he saw that only about 750 were legitimate. Elicio analyzed those e-mails and decided to take the more drastic measure of blocking e-mail generated from suspect origins. "We're a hospital in southern New Mexico. We don't deal a lot with Czechoslovakian companies," he quips. With that, plus other SurfControl filters, Elicio has reduced spam to 300 messages a day. "Spam doesn't get to the in-box," he says.

The downside is that the more aggressively a network executive blocks spam - the more "intelligent" and automated the process - the higher the risk of false positives. But Symantec and other vendors using neural networks say that technology applies self-learning in a safer and more appropriate way than Bayesian filtering could at the gateway.

A neural network, based on artificial intelligence algorithms, is similar to Bayesian filtering in that the software trains itself to recognize new spam. But the software for training resides at the vendors' sites, not on users' clients. The batch of e-mail by which neural networks train comes from thousands of phony e-mail in-boxes the vendors set up to collect spam. With millions of spam messages to examine hourly, the machine-learning software is constantly on the cutting edge of spammers' tricks, vendors say.

But, similar to anti-virus software, products using neural networks require users to update their gateway software at regular intervals, usually once a day. Messages identified as spam at the gateway can be lodged in a spam folder for administrator review. Or, if the message's spam rating is high enough (while the company's concern over false positives is low enough) the message can be deleted outright.

1 2 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies