It works! – Geek Drivel

As reported yesterday, I’m now using ThunderBayes/SpamBayes to filter spam. I manually classified several hundred recent spam messages, and a roughly-equal number of recent personal (“ham”) messages. So far, it hasn’t had a single false positive on either side, and most of the “unsure” classifications have been of legitimate commercial messages (that do resemble spam in many respects).

I turned on the “evidence” option in SpamBayes, so that I can see what it used to make a determination when I look at the headers for a message. It’s interesting… it quickly picked up the usually-reliable spam words (“only”, “longer”, “erections”, and “embarrassed”, for instance), but some of the others it’s coming up with are surprising… “government” gets a 0.97 spam probability — it apparently showed up in six of the spam messages I trained it on, and none of the hams. “charset:windows-1252” gets a 0.91 (63 spams, 6 hams), probably because most mail from my legitimate acquaintances is written either on Linux or on alternative mail programs under Windows. It also picked up on one of my e-mail addresses — a good portion of my spam comes in on that address.

Lots of fun all around. 🙂