Preventing comment spam the quick way
8th of February 2006
On my personal website I haven't suffered much from comment spam because it's a written from scratch and comment spammers usually target more generic blog commenting systems like MovableType or Blogger. But I did suffer too and before I starting inventing something smart I thought I'd just quickly stop the most of it by some string matching. Believe it or not but it really has worked. I started working on the script in August 2005 and since then my script has stopped 2,087 spam comments (about 13 per day). Here's how I did that.
As soon as the spam started coming in I quickly noted which evil terms appeared in the comments such as "viagrapoker" or "credit-dreams.com" and wrote them down in the code. I also wrote down some combinations like "poker" AND "a href" which meant that it's only considered spam if found together but not individually. I also wrote down a list of words that didn't reject the comment but only uploaded it unapproved (saved but not published).
I restarted the comment saving code and waited until the next piece of comment spam managed to get through. Every time I saw something like this I quickly amended the list with the new words and this list slowly started getting bigger. Fortunately I had a nice setup with my Zope server to quickly amend the list and for it to take effect immediately plus all the source code backup and other stuff. Here's what it looks like as of today (NB: this code is part of something bigger that I've copied it from. Read rather than run)
Now the comment spam rate has gone down and I think there's an important reason why. The spammers who first successfully managed to spam my site got 200ish HTTP codes back which probably means that they thought they could come back with their evil spamming bots. Now, those who get trapped get 500ish HTTP codes back which probably means that they start giving up on my site.
Even if it's an ugly and perhaps naive way of doing things I stop a large majority of the comment spam, there's no hiding from that fact. The next thing would be to implement a bayesian filter that you train to not only recognize spam but also what's good stuff. If you do that, sane comments can be added that discusses that same stuff that I'm discussing right here without being rejected.
UPDATE: I did a recount of how many spam comments this solution has trapped: 17,000 in 10 months (about 60 comments/day)