- Silicon Valley's 19 Coolest Places to Work
- Is Windows 8 Development Worth the Trouble?
- 8 Books Every IT Leader Should Read This Year
- 10 Hot Hadoop Startups to Watch
Network World - Spam, the four-letter word of the virtual world, seems to be on everyone's lips these days, from rank-and-file workers to Congress and the British Parliament. The buzz occurs with good reason: Spam has leapt from a minor, annoying byproduct of e-mail to an epic business problem. Unsolicited e-mails are growing at a rate of 5% per month, according to a Kessler International survey. That means thousands of unwanted e-mails per week, often totaling 75% of the messages an enterprise e-mail gateway must process - while clogging downstream wires and servers, users say.
Spam on that scale also sucks up employee time; Nucleus Research reports that nuisance e-mail costs $874 per person annually in lost productivity (see how much it costs your company using our Spam Calculator). And with some messages so obscene as to make a merchant marine blush, much of the spam content is inappropriate for a business environment, if not outright illegal.
Government intervention has been discussed considerably as a solution, but network professionals aren't holding out for relief from legislation. Its effectiveness will be iffy at best (see story ). For the time being, exterminating the spam menace will remain the task of the network team. That's easier said than done. E-mail marketers constantly find ways to thwart existing e-mail filters. Anti-spam software vendors, in turn, create new filters intended to spot spammer's latest tricks. Not only must network executives frequently update software to get the latest filters, but the more filtering they switch on, the higher the chance that legitimate e-mail gets mislabeled and deleted as spam.
The latest crop of filters promises to stop this yo-yo cycle. These filters are based on "self-learning" or "machine-learning" technologies that attempt to adapt automatically to spammers' new tricks while protecting legitimate e-mail. Among machine-learning technologies in commercial spam filters, Bayesian filtering and neural networks are the most talked about, with Bayesian filtering generating a downright roar. In the past few months, this type of filter has been implemented in a growing number of anti-spam products, ranging from open source product SpamAssassin to an enterprise-class spam-detection module from start-up ProofPoint.
Users who have tried Bayesian filtering recommend it.
"I implemented SpamAssassin before Bayesian was part of it, and it worked pretty well, but with Bayesian, it works much better. Bayesian is essential in this day and age," says John Stewart, senior technical specialist for Artesyn Communication Products, in Madison, Wis.
But, unlike more established anti-spam technologies like dictionary scans, blacklisting and heuristics, today's buzzy machine-learning filters are not always a straightforward affair for a corporation. (See The Anti-Spam Glossary.)
Bayesian filters are based on an algorithm for classifying documents, says Paul Graham, an independent programmer who created an early, open source Bayesian spam filter. Because all spammers must somehow state their message - despite any tricks they use to fake out filters - Bayesian's techniques for intelligently classifying content have proven effective.
First, the user divides e-mail into two piles, spam and not-spam, from which the filter trains itself. The filter analyzes every word in each e-mail and determines how frequently the word occurs in the spam pile vs. the not-spam pile. For instance, when the filter finds "V-1-A-G-R-A" in spam but never in not-spam, V1AGRA earns a 100% probability of being a spam "word." Because "the" occurs equally in spam and not-spam, it gains a neutral 50%, while innocent words such as RFP would occur in not-spam but rarely (if ever) in spam, giving it, for instance, a 99% probability of being a not-spam word.
When an e-mail arrives at the trained Bayesian filter, the filter looks for the 15 words with the highest probabilities - "either very guilty or very innocent," Graham says - and uses them to calculate the message's overall spam probability. Tricks like replacing the "i" with the numeral 1 in VIAGRA might confound the simple dictionary filter, but they help the Bayesian filter.
"If spammers say V1agra . . . my God is that guilty - even more than Viagra. Conceivably, people might be writing each other e-mails about Viagra, but they are not going to be writing about V-1-a-g-r-a," Graham says.
In most products, Bayesian filtering is only one of several tests in a heuristics process that determines the e-mail's overall spam probability. Once that's done, the anti-spam tool embeds a spam rating in the message header and then typically sends the e-mail on to the client's e-mail software, which uses the tag to sort and/or delete the message, per user instructions.
Should the filter err, calling a legitimate e-mail spam or questionable (the false positive) or tagging spam as legitimate (the false negative), the end user would send the falsely labeled message to the correct folder. The filter uses these folders to retrain itself daily, or per user-specified frequency. Regular training assures that the filter automatically learns the latest spammer tricks (such as garbage characters in the subject line and spaces between letters). Filtering is also personalized. The banker can accept mortgage offers from the competition as legitimate - while the office manager deletes them as spam.
Because Bayesian filtering was designed for the client, it is most commonly a feature of consumer products, with a price of about $30 per license. Vendors, no doubt, would be willing to negotiate less-expensive, volume prices for enterprise deployments. Many Bayesian filters also are available for free as open source tools.
While Bayesian filtering's client-side bent is an elegant way to stop spam from sponging up productivity, its drawback is that it introduces major client-management headaches when used on an enterprise scale. That calls for some creativity.