I’m glad to note that SpamAssassin 2.50 has been released, bringing Paul Graham’s now-famous Bayesian filtering idea to the world of detecting unsolicited email. The new version also brings improved rules for detecting the telltale signs of bulk mailings, as well as a better way of modifying mail it suspects is spam (it encloses the original mail as an attachment, and then changes the main message to a preview plus an explanation of why it thinks the mail is unwanted). Mostly, I’m happy that the release made it into the world at all, given Deersoft’s acquisition by Network Associates last month. Here’s hoping to continued open source releases…
Feb 25, 2003 | Q
How’s the bayesian stuff work exactly? Looking at the most recent Readme, it doesn’t really talk about how you train or customize it. I love SA because it’s server-side and works no matter if I read email from my mac, from my pc, or from a command line on the road, but I’ve kept away from bayesian filters because they’re all client-side. If SA has incorporated bayesian filters, how do I communicate “this is spam” and “this is not spam” to the server with my varied set of mail reading tools? Is it like the whitelisting, requiring that I login remotely to edit a text file?
• Posted by: mathowie on Feb 25, 2003, 1:55 AMMatt, the Bayesian stuff works in a bunch of ways, but most importantly, only works as part of the entire approach. In other words, no mail is classified as spam or non-spam purely on the Bayesian analysis, at least not with the default settings.
First, the default is for SA’s Bayesian system to autolearn spam and non-spam (ham). It does this by feeding any mail that garners a spam score greater than 15 into the Bayesian trainer as spam, and feeding any mail that garners a spam score less than -2 into the trainer as ham. (Of course, as with everything else in SA, you can change these score values as you see fit.) So without doing anything, SA should start to learn what’s spam and what’s not.
Second, you can train it. The tool that you use is named sa-learn, and the best way to see how it’s used is to either go to the directory with all the SA apps and type perldoc sa-learn or go to this page. It requires you to have mailboxes that contain either pure spam or pure ham, and then trains off of those mailboxes. (It can handle mailboxes that have mail already tagged by SA — it’ll just remove the relevant tags and then process it.) According to that documentation, the best amount to use of each type of mail is between 1000 and 5000 messages. Note that there are a few repositories of spam out there that you can use to train your Bayesian filters, but it’s not the best idea, since you want your filters to understand your spam and non-spam, not someone else’s.
Lastly, there still is the problem of communicating every now and then that something was misclassified. I haven’t yet found a good client-side way to do that, but what I am planning to do is create two IMAP mailboxes — one spam, one ham — into which I can copy misclassified messages. Every so often, then, I’ll have the server process the mailboxes and do the right thing. It’s complicated, but I figure that as the Bayesian stuff gets more mainstream, the client side will catch up.
• Posted by: Jason on Feb 25, 2003, 9:40 AMis that similar to the adaptive latent semantic analysis technique that Apple Mail uses? In which case you just click on Spam/Not Spam until it gets it right automatically. Takes a week or two to weed out the odd exception and then forget spam.
• Posted by: John on Feb 25, 2003, 8:15 PMJohn, I have no idea what Apple’s Mail.app uses, and from my web research, I don’t think that anyone knows.
• Posted by: Jason on Feb 25, 2003, 10:45 PMThanks for the pointer to the new Spamassassin.
The new Bayesian stuff looks very interesting indeed.
• Posted by: Darren Greaves on Feb 26, 2003, 10:46 AMI have been using spamassassin for about 6 months and it’s great - never had a false positive - but have had a few spams slip through.
Hopefully the new stuff will help with that.