Archive for February, 2004

Feb 26 2004

Training Pair’s Bayesian Filter

Published by Ian Davis under Personal

My spam levels have risen to unbearable levels. I filter both on the client side with the BayesIt plugin for The Bat! and the server side with Pair’s SpamAssassin installation. I’m fed up with having to download spam though. I don’t want to see it, I don’t want to even think about it. So my goal is to make the server side filter better and, as with any Bayesian filter, it can be trained.

On the client side it’s easy – just hit a key and BayesIt analyses the message and improved the spam filter. On the server side though there’s a disconnect since I can’t hit a button to tell the server what is spam and what isn’t. After perusing Pair’s support pages I came across a document describing how to train the bayes filter. It turns out that you can use a mailbox file to train the filter. So I created a spam collecting email address spam@internetalchemy.org. Then I set up a filter in The Bat that redirects the selected mail to spam@internetalchemy.org and deletes it from my inbox. The Bat has a neat redirect facility that resends a message to another address while preserving the original sender info. This means the email ends up in the spam@internetalchemy.org inbox almost unchanged from how it appeared in mine. Hopefully this will prevent the bayesian filter from learning that forwarded messages from me are spam! I bound this to a hotkey so I can scan my inbox and redirect spam to the collecting mailbox. The final piece of the puzzle was a cron job to use the contents of the spam@internetalchemy.org inbox to train the filter and then empty the mailbox.

So, at the press of a key I can train the server-side spam filter. Additionally, the more I mention spam@internetalchemy.org on the web, the more the spammers will send me juicy training material. Yum yum.

Comments Off

Feb 13 2004

It’s The Graphs

Published by Ian Davis under Uncategorized and tagged as

Stefano Mazzocchi gets it:

For years, I thought that XML was the king and RDF was its knight. Well, I got it all wrong: it’s the other way around, it’s just that it’s very hard to realize it.

People in all sort of communities realize how important semi-structured data is and how much it will be in the future. These people tend to think that XML will solve the problem for them and once we have a serious XML query language, we’ll be set forever.

Well, wrong. Relations are tables, XML documents are trees, and, guess what, RDF models are graphs. Yep, you got it, RDF candescribe both.

The more I discover RDF and RDF query languages, the more it seems to me that it was all about that stupid RDF/XML syntax that prevented people from getting what RDF really was. RDF is a model for describing labelled directed graphs. That’s it. You can add typing (RDFSchema) or inference (OWL), but the real deal is that you now have a way to describe graphs.

Many people tend to think at RDF as a uselessly complex way to write markup more formally. Some people tend to think that if XML is data, RDF is metadata. Wrong and wrong. An RDF model is a graph. Period. And since all trees and tables are graphs, you can have an RDF representation of any kind of data you want.

Comments Off

Feb 06 2004

Opera Presentations

Published by Ian Davis under Uncategorized and tagged as

Another neat feature from Opera: OperaShow. Like most web browsers, Opera has a full-screen mode which removes all the menus and toolbars to maximise screen space. What makes Opera special is that it checks the current CSS to look for a @projection media type and uses those stylesheet rules to override the @screen rules. With some judicous use of page-break-after you can convert an HTML document into a presentation with screen-by-screen paging. What an amazingly simple idea!

Comments Off