Traditional spam-catching systems work by predicting the likelihood of a piece of email being an unsolicited ad. The task of prediction isn’t easy, though, and as a result, users still have to deal both with unwanted mail that gets through the filters and with legitimate mail that’s caught and filtered away. As a result, there are a few ideas floating around out there about alternate approaches to the unsolicited email problem, approaches that try to achieve lower false-positive and false-negative rates. Two that caught my eye today are IronPort’s Bonded Sender Program and Habeas’ Sender Warranted Email.

The Bonded Sender Program turns the traditional approach around, aiming to guarantee that a specific piece of mail is not spam. It’s able to do this because companies contract with, and pay, IronPort to list their outgoing mail servers in a database of machines guaranteed not to send spam. Then, when your mail server accepts a piece of mail from a machine, it checks to see if that machine is listed in IronPort’s database, and if it is, the mail flows through any spam filters and into your inbox. This seems like a great way for companies that operate legitimate, double-opt-in email lists to make sure that their sales missives reach the intended audience — it appears to be poison-proof (meaning that spammers can’t fake the system into thinking that they’re legitimate), and at least one of the big spam filter providers, SpamAssassin, is on board.

Sender Warranted Email works in another way, and one that I can’t imagine will be able to sustain itself. Senders “warrant” that their email isn’t spam by including a “trademarked, copyrighted” set of headers that they’ve paid for the right to use; it’s these headers that filters look for to decide that the mail is legitimate. Habeas promises to aggressively sue anyone who uses the headers without the right to do so, providing the teeth behind the system. (Wired News wrote about this back in August.) Unfortunately, I envision that almost every piece of unsolicited email will soon include the headers, in an effort to overwhelm Habeas and make the company unable to go after everyone who is circumventing the rules. (You know the signature block that still graces the bottom of mail unsolicited emails, claiming to be acceptable under some obscure Senate rule? Same thing.)

Despite the questionable long-term effectiveness of Habeas’ approach, I applaud both companies for coming up with new ways of attacking the problem. With spam making up an estimated one third of email sent daily, someone’s got to tackle this problem before it takes the entire mode of communication down with it.

Comments

I think people are giving up on automated solutions too easily. I’ve seen a lot of articles recently (your post, the Slate article yesterday) that dismiss automated solutions in the first paragraph, with a single sentence: “it’s hard!” Well, that’s why we have bright computer people.

Paul Graham has created a spam filter that learns the difference between spam and non-spam, using Basyesian statistical techniques: “we now miss less than 5 per 1000 spams, with 0 false positives.” And this is using a naive Bayesian technique, assuming that words appear independently. Allowing probabilities to be chained together — reverse Markov chaining, basically - would yield even better results (although at a speed cost.)

• Posted by: Lukas Bergstrom on Nov 21, 2002, 12:48 PM

Most spam I get is from 3rd cousins sending me “humor” — any bonding to fight that?

(btw: I got this trying to post here:

Your comment had the following errors, which you need to correct below:

Name and email address are required.

…and I’m supposed to trust you? [silly smirk])

• Posted by: victor on Nov 21, 2002, 1:41 PM

I agree with you about the coolness of Bayesian filtering, but that being said, I don’t know if it will be a panacea. In order for it to be truly effective on the individual user level, it requires the users to train it about what is and isn’t spam before they start using it. The whole point of Bayesian filtering is its acknowledgement that everyone’s incoming mail is different; words like cancer and prescription are more likely to be in legitimate email to a doctor than they are to a web programmer. If there were to be a more generic Bayesian filter implemented at the consumer level (i.e., one which the programmers have trained with a corpus of generic spam and generic “good” mail), then it’s not really trained for the users who are using it, and the likelihood of false positives and negatives increases.

(Note that this is just my poor-man’s perspective on it; being a statistical solution, the statistics community has also published critiques (PDFs) of the Bayesian approach that offer far more than I can in the way of analysis.)

All that being said, there’s clearly a lot to be offered by a well-implemented Bayesian filter. Since Paul’s article, there has been a flurry of activity on the programming front; here are the options, available for use today, that I have found in about ten minutes of searching: BogoFilter, spambayes, Bayesian Mail Filter, spamcan, Bayespam, SpamSieve, POPFile, and spamfilter. And the happy news for SpamAssassin users is that the soon-to-be-released version 2.50 adds Bayesian filtering. All this means is that the spamcatching method is going to be tested in the real world, which can only be a good thing.

(And, on preview: yeah, Victor, you can trust me. [smirk of my own])

• Posted by: Jason on Nov 21, 2002, 2:25 PM

Excellent summary of Bayesian mail filtering, thanks. I don’t think asking users to train their own spam filters is too much of a barrier. It would be enough to have a ‘Delete and Mark as Spam’ button. If the program then assumes that anything not marked as spam can be placed in the ‘good’ corpus, training the filter should be quick and painless.

Can you imagine if Microsoft integrated this into Hotmail? It would be a win for them: people would pay for effective spam-blocking. And it would, in the end, be a win for everyone, because the incentive to spam would go down.

• Posted by: Lukas Bergstrom on Nov 21, 2002, 2:38 PM
Please note that comments automatically close after 60 days; the comment spammers love to use the older, rarely-viewed pages to work their magic. If comments are closed and you want to let me know something, feel free to use the contact page!