QDN: Bayes comes to Movable Type

Oct 16, 2003 | Weblogs

Holy crap — James Seng wrote a Bayesian spam filter for comments on Movable Type sites. Similar to the first responses to unsolicited email, the first responses to the comment spam problem were custom blacklists. And just like with e-mail blacklists, I felt a little uneasy about the idea of comment blacklists, because they rely on imperfect data that’s susceptible to sabotage and simple mismanagement. A Bayesian filter should, in theory, be both accurate and less reliant on the continued vigilance of others; if this thing works as advertised, it would be crazy for Six Apart to not slurp this thing up and integrate it into TypePad as an option.

Two comments:

> they rely on imperfect data that�s
>susceptible to sabotage and simple
> mismanagement

Mismanagement, I can understand, but imperfect data and sabotage? How so?

> A Bayesian filter should, in theory, be
> both accurate and less reliant on the
> continued vigilance of others

Comment spam is unlike email spam. They aren’t trying to sell you something with big huge colorful text and offers for 100% improvement in whatever!!!!!!! Most often, the text of the comment looks fairly normal but it contains links that are the problem. How can a Bayesian filter catch that?

Oh, and you still have to train it. And Lord if you train it wrong, you are in trouble…

• Posted by: Jay Allen on Oct 17, 2003, 1:27 AM

Outright blacklist gives very high false positive. Bayesian is both a form of blacklist and whitelist with a “fuzzy” logic behind it, which you have to train.

The problem is Bayesian is only as good as you train. But same with blacklist, which is only as good as the blacklist you provide it. Worst, you irrevocately ban others too.

MT-Bayesian is a modified Bayesian algorithm to handle the typical short comments and trackbacks.

• Posted by: James Seng on Oct 17, 2003, 1:48 AM

“text of the comment looks fairly normal but it contains links that are the problem. How can a Bayesian filter catch that?”

And yes, MT-Bayesian specifically is designed to handle that.

• Posted by: James Seng on Oct 17, 2003, 1:59 AM

Jay:

Re: sabotage, follow the “just like with email blacklists” linked text in my post. Essentially, when the Osirusoft DNS-based email blacklist was subjected to a denial-of-service attack, the owner decided to screw the ‘net by changing it to report all mail hosts as spam havens. It was weeks before most mail systems were modified to deal with his sabotage.

Re: imperfect data, there are two problems. First, the presence of a machine in a shared blacklist is generally based solely on one person’s decision to put it there. Why did he put it there? Nobody knows; it could have been real, but it also could have been spite, intolerance of a poster, or just a misunderstanding of what someone was saying in a comment. Yes, most comment spam is obvious, but just as it happens with email, I can imagine people heaping comments into the spambox just because they don’t want to deal with them. (For example, would people trust that every machine that would appear in Dave Winer’s blacklist would really be a spammer, or would there be a good chance that a good chunk of them were just the machines of people who rubbed him the wrong way?)

Second, the presence of a machine — or, more accurately, an IP address — in a blacklist implies that that IP address solely belongs to one person. But in this day and age of shared IP addresses, either via multiuser machines or dynamically-assigned addresses, means that that’s not always true, and as such, the blacklist has the real potential of sweeping people into it that don’t belong. That’s imperfect data.

• Posted by: Jason on Oct 17, 2003, 7:52 AM

> Outright blacklist gives very high false positive

Only if someone uses regular expressions incorrectly as I did in one of my recent releases. Other than that, there are no false positives. There’s no acceptable comment except maybe this one, that contains the string kinky-granny.pornwww.com. If the user is careless, then there are false positives, but again, then it only affects that user’s website.

Re: sabotage. Jason, someone can only sabotage their OWN blacklist, causing their own site to be hurt. That’s not really a problem…

> For example, would people trust that every machine
> that would appear in Dave Winer�s blacklist would
> really be a spammer, or would there be a good
> chance that a good chunk of them were just the
> machines of people who rubbed him the wrong way?)

Jason, just like in real life, if you’re going to trust someone, you open yourself up to their goodness and maliciousness. In any case, a site owner trusting someone untrustworthy is the site owner’s problem, not mine. MT-Blacklist doesn’t solve stupidity. In fact, no software does.

And as far as IP blacklists, I’5tgb-[ve screamed it from every mountain. They are useless, ridiculous, imperfect, whack-a-mole solutions. Whoever is using IP blacklists needs to learn a little bit more about the internet.

While I will agree that many blacklist implementations and models are flawed, I still haven’t heard an valid criticism specifically of MT-Blacklist’s implementation (other than a couple of bugs which will be ironed out in the next version), but would be happy to hear some and adapt the program as necessary to best serve the needs of the community.

And James, don’t get me wrong. I love that you created MT-Bayesian, and I hope to one day soon be able to open up MT-Blacklist as a general engine for other people’s filters including yours. Users should have at their disposal as many tools as possible and be able to use them all easily and seamlessly. I look forward to trying out MT-Bayesian. I am skeptical that it would be very successful, but I hope that it is. Regardless, I and many others appreciate your efforts.

• Posted by: Jay Allen on Oct 18, 2003, 9:24 AM

For a comparision between Blacklist and Bayesian, see http://www.paulgraham.com/falsepositives.html

For those of us (especially email operator who has a lot of email users) who have been fighting email spams for many years, we have seen what works and what havent.

• Posted by: James Seng on Oct 18, 2003, 10:58 AM

I did a quick look at your MT-blacklist. The whole logic to determine if a comment or ping is spam or not comes down to this:

foreach $deny (@blacklisted_strings) {
if ($str =~ m#$deny#i)
return $config->{logDenials} ? (1:$deny) : 1;
}
}

Effectively, so long a comment have a blacklisted word (even substring), it will be banned. Suppose I have “porn” as a blacklisted word, almost every comments to my entry on “Should we ban porn?” would be banned.

Hence, the simple mindedness of blacklist logic is the problem, whether it is IP blacklist or content blacklist. Bayesian, at least, is a fuzzy logic which analysis the content before giving you a probability of spam.

• Posted by: James Seng on Oct 18, 2003, 11:09 AM

Jay, your implementation is based solely on strings, not IP addresses? That changes things a little, and I agree that it’s for the good. But I also don’t see this as a competition — I, like you, would love to find whatever it takes to just deal with this problem before it really gets going. For me, Bayesian filters have been a godsend on the email front… as part of SpamAssassin, which includes both functionalities. Maybe a combined attempt will ultimately be what works here, too!

• Posted by: Jason on Oct 18, 2003, 11:48 AM

We have apparently moved this silly discussion over here.

Jason, absolutely. I came at this whole thing from a blacklisting standpoint with my first post but like I said, the more tools the better.

Bayesian filtering has indeed been a godsend on email. I don’t know how it will work in this environment, but I hope it kicks ass.

• Posted by: Jay Allen on Oct 18, 2003, 1:48 PM

Today I am pleased (proud, relieved, thankful, etc) to announce the release of MT-Blacklist v1.5. There are some major changes…

• Pinged by JayAllen - The Daily Journey on Oct 29, 2003, 9:52 AM

Please note that comments automatically close after 60 days; the comment spammers love to use the older, rarely-viewed pages to work their magic. If comments are closed and you want to let me know something, feel free to use the contact page!