Vendors map the DNA of spam
DNA pattern detection applied to spam
By
Michael Osterman, Network World
September 27, 2004 12:09 PM ET
One of the keys to genetic research is to analyze DNA and other substances for patterns. Understanding when and where these
patterns occur is important for determining genetic sequences that have important implications. Because spam messages also
contain recurring patterns, the same basic genetic techniques can also be successfully applied to filter spam from e-mail.
There are several companies that are using techniques based on pattern recurrence to successfully detect spam. IBM, for example,
has developed a technique using the Chung Kwei algorithm that has been demonstrated to capture a high percentage of spam.
In a test of the algorithm, 96.6% of spam was correctly identified. Cloudmark uses what it calls “e-mail genetic mapping,”
a somewhat different technique that is based on end user and administrator feedback to “learn” what constitutes spam for individuals
and organizations as a whole. Cloudmark claims that its technique has the potential for capturing 100% of spam while generating
no false positives. Another technique is employed by Commtouch with its Recurrent Pattern Detection technology that looks
for patterns in spam outbreaks in real time. Independent tests of Commtouch’s RPD technology found that it captures about
97% of spam while generating almost no false positives.
Looking for patterns in e-mail as a means of detecting spam is important for a couple of reasons. First, it simply adds another
method to more traditional methods of detecting spam, potentially improving the overall effectiveness of a spam-blocking tool
that incorporates multiple detection techniques. Second, and more importantly, pattern detection may make it more difficult
for spammers to circumvent spam-blocking systems because patterns are inherent in spam and are more difficult to overcome.
In short, it’s difficult for spammers to create their stuff without recognizable patterns emerging.
To continue reading, register here and become an Insider. You'll get free access to premium content from CIO, Computerworld, CSO, InfoWorld, and Network World. See more Insider content or sign in.
One of the keys to genetic research is to analyze DNA and other substances for patterns. Understanding when and where these
patterns occur is important for determining genetic sequences that have important implications. Because spam messages also
contain recurring patterns, the same basic genetic techniques can also be successfully applied to filter spam from e-mail.
There are several companies that are using techniques based on pattern recurrence to successfully detect spam. IBM, for example,
has developed a technique using the Chung Kwei algorithm that has been demonstrated to capture a high percentage of spam.
In a test of the algorithm, 96.6% of spam was correctly identified. Cloudmark uses what it calls “e-mail genetic mapping,”
a somewhat different technique that is based on end user and administrator feedback to “learn” what constitutes spam for individuals
and organizations as a whole. Cloudmark claims that its technique has the potential for capturing 100% of spam while generating
no false positives. Another technique is employed by Commtouch with its Recurrent Pattern Detection technology that looks
for patterns in spam outbreaks in real time. Independent tests of Commtouch’s RPD technology found that it captures about
97% of spam while generating almost no false positives.
Looking for patterns in e-mail as a means of detecting spam is important for a couple of reasons. First, it simply adds another
method to more traditional methods of detecting spam, potentially improving the overall effectiveness of a spam-blocking tool
that incorporates multiple detection techniques. Second, and more importantly, pattern detection may make it more difficult
for spammers to circumvent spam-blocking systems because patterns are inherent in spam and are more difficult to overcome.
In short, it’s difficult for spammers to create their stuff without recognizable patterns emerging.
Read more about software in Network World's Software section.