Bill Yerazunis' day job is far from boring. As senior research scientist at Mitsubishi Electric Research Laboratories, the electronics giant's North American R&D arm, Yerazunis has been involved in developing items as diverse as sensors that detect water pollution; a touch-sensitive table for small group collaboration; and a self-refilling beer mug
Bill Yerazunis ' day job is far from boring. As senior research scientist at Mitsubishi Electric Research Laboratories, the electronics giant's North American R&D arm, Yerazunis has been involved in developing items as diverse as sensors that detect water pollution; a touch-sensitive table for small group collaboration; and a self-refilling beer mug. But for Yerazunis, the real fun has begun after-hours; he has spent the last seven years developing and tweaking CRM114 Discriminator, an open source spam filter that uses statistical probability to determine whether an e-mail is spam. CRM114 is used by individuals, corporations and some ISPs. With that success comes additional corporate responsibility. On April 1, Yerazunis officially will add spam catcher to the many roles he plays at the Cambridge, Mass., lab. But first he'll chair the fifth annual Massachusetts Institute of Technology Spam Conference, scheduled for March 30. Yerazunis recently spoke with Network World's Senior Editor Cara Garretson about his personal spam crusade.
Get a description of the CRM114 and
How did you get involved in fighting spam?
I was frustrated by it, so years ago I said to my manager, 'We ought to do something about spam,' and he said, 'Don't worry about it, Bill, spam will never be a problem.' I asked him if I could work on it on my own time and he said, 'I can't stop you.' He's still around. It's like the flight-instructor joke: That's one mistake he'll never make again! I was going to work on a reputation-based system that said, 'If I've gotten mail from this person before, then it's probably good; if not, then it's probably bad.' Then I said, that won't work well. So I went to a heuristics model. But those act reactively. The results you get with [the open source Apache] SpamAssassin are 90% or 95% accurate, but I wanted more - so I started doing statistical filtering.
Did spam get worse in 2006 and, if so, why?
Yes. The amount of spam has increased over time, but most filters have held up quite well. But in 2006, we started getting [at least twice as much] spam. [Through comparative filter tests,] we know the spammers aren't really evolving their techniques, they're just pumping in more spam and there are more people with bad filters. And spam has become the single driving force in the penny-stock market now. [Stock pump-and-dump spam e-mails try to convince recipients to buy shares of a certain spammer-owned penny stock. When enough recipients buy the stock, the spammer sells the stock at a profit.] There are Web pages that are 'rotisserie' stocks, where they pretend to invest $1,000 on each one that comes in. The Web page operators have lost nearly a quarter of a billion dollars at this point.
What's the outlook for 2007?
I would love to say I've got the magic elixir, but I don't. The good news is for subscribers of very large ISPs, because those ISPs get huge amounts of text to put through their filters. Other people [whether using enterprise or home e-mail] aren't going to see spam go down unless they better train the filters. Those people running without big ISPs or [good filters] are going to give up on e-mail. It's already useless without a filter. It used to be the delivery people provided assurance: Thou shall not lose an e-mail. At ARPANet [the Internet's predecessor], they made sure they could function in the face of a nuclear holocaust - that was the mindset. Now that's gone. Now you've got plausible deniability on e-mail: 'Oh, I never got it. My filter ate it.' It's greased the skids of human interaction because you can send something to someone and they can disregard it.
So how do you train a filter?
Training is actually pretty easy. If you've ever used Yahoo Mail, or [Google's] Gmail or [Mozilla's] Thunderbird, you just click the button labeled 'This is spam,' [or, if available and appropriate, the one labeled] 'No! This is not spam.' The software takes it from there. That's how learning filters get their data. Underneath the hood, some fairly heavy mathematics happens, as the filter recalculates lots of probabilities and statistics. But the users don't see that; all they do is click and the magic happens - the system retrains itself and gets a little smarter each time.
How well have antispam vendors kept up with spam?
I'm not happy with what the filter vendors are selling. I had a couple [commercial spam filters] in the lab and had them turned off for me because these filters aren't flexible enough; anything not directly addressed to you is spam. [Some vendors] are making interesting claims: 'We never lose an e-mail.' But that's because they bounce it instead. But other people I know have gotten very good results, and people do make a profit selling the commercially available stuff.
What's your favorite spammer trick?
I liked the 'I am not a spam' spam. Look at it in terms of psychological warfare - to get a message to enemy troops, you state it upfront. You get these five or 10 lines in a message that say 'We have X for sale' - it's an antitrick. The second one makes me think spammers are profiling people. I've gotten spam that my filter nailed - I check my filter every day because I need to know how well it's doing. This one had text in it from something I'm working on, an aspect of molybdenum/vanadium chemistry, and it fooled me. I thought it might have been a chemistry paper intentionally sent to me by a co-worker, so I clicked on the Web payload address. It was a porn site. Humans are 99.5% to 99.9% accurate at discerning spam within just a few seconds of looking at it. Humans are reasonably good filters. Interestingly, the vanadium spam did not fool my spam filter. My filter told me that it was a bad chemistry paper and that I wouldn't like it. Maybe that means my spam filter is a better chemist than I am!
Last year, phishing was the big threat. Is it still, or is something else lurking?
The phishing problem is pretty stable. The 2007 MIT Spam Conference topics are spam, phishing and other cyberfraud, especially the stock pump-and-dump [spam] - it's the new black, the new fashion.
< Previous story: VPN contracts: the missing link>