Fighting WikiSpam: Eaton and Shared Blacklists

WikiSym 2005 was awesome. Massive props to Dirk Riehle and the program committee for throwing an outstanding event and drawing tons of great, great people. With Wikimania last August and WikiSym this past week, the Wiki community is really starting to gel. And it’s about time. Can you believe Wikis are 10 years old?    (JXD)

Now the bad news: I walked away with some action items. How do I get myself into these messes?!    (JXE)

The first action item can be traced back to an ad hoc meeting that happened at Wikimania regarding WikiSpam. On August 6, a group of Wiki developers — me (PurpleWiki), Alex Schroeder (OddMuse), Brion Vibber (Mediawiki), Thomas Waldmann (MoinMoin), Sven Dowideit (TWiki), Janne Jalkanen (JSPWiki) — along with John Breslin and Jochen Topf, got together to discuss ways we could collaborate on fighting WikiSpam. Our goal was to identify the simplest possible first step and not to get mired in process discussions.    (JXF)

Since all of us were already maintaining URL blacklists, we decided to merge them and host it as a Sourceforge project. We agreed on a standard format (which I’ll document and post soon), and we agreed to send our respective lists to Alex, who already has scripts to slice, dice, and merge.    (JXG)

One of my action items then was to create the Sourceforge project. I did that immediately, but for some reason, the project was rejected. Thus began a month-long go-around with Sourceforge support where I tried to discover why they had rejected the proposal. In the end, the project was approved, and I never got an answer as to why it was rejected in the first place. At that point, I was mired in other work, and so I never followed up.    (JXH)

WikiSym was the kick in the butt I needed to follow-up. On Sunday, Sunir Shah hosted an antispam workshop, which about 40 people attended. First, Sunir reviewed techniques (many of which are listed at MeatBall:WikiSpam). Then we broke out.    (JXI)

In my breakout, I described what we had agreed on at Wikimania. Then Peter Kaminski described a very cute idea he had for making it easy to fight WikiSpam. In a nutshell, Peter suggested we write a simple drop-in replacement CGI wrapper that would filter a POST payload for spam and call the real CGI script — be it a Wiki, a blog, or anything else — if the payload were spam-free. Such a wrapper would enable users to install spam-protection for any CGI script without having to write a single line of code and without having to do any complex configuration. It wouldn’t require any special access to your web server, since it would just be a CGI script. And you could easily add other spam-fighting measures, such as throttling and IP blacklists.    (JXJ)

I thought it was a brilliant idea. So Peter and I sat down afterwards and whipped it up. Took about an hour. It’s called Eaton, it works, and it’s Public Domain. Peter Kaminski has already blogged about it, and there’s some important commentary there from Jay Allen, the creator of MT-Blacklist.    (JXK)

It’s a proof of concept, and it won’t scale. It can and should be improved, and I’d encourage folks to do so. Nevertheless, it’s pretty cool. Bravo to Peter for a very clever idea.    (JXL)

By the way, the first person to figure out the origins of the name “Eaton” wins a cookie.    (JXM)

Fleischbutter

With WikiSym about to start, I want to close the loop on an obscure Wikimania item I posted last August. I mentioned something about Fleischbutter. Several folks emailed me asking what it was.    (JX8)

It’s exactly what it sounds like, folks. The literal translation is “meat butter.” More information is available here (thanks to Samuel Klein for spotting this). If folks have pictures of this legendary German dish, please post them there.    (JX9)

Speaking of legends, my friend Dave Arnold recently forwarded me a valuable resource regarding the Flying Spaghetti Monster. I encourage you all to spread the word.    (JXA)

Purple Numbers Are Ugly

Evan Henshaw-Plath thinks that Purple Numbers are ugly. He’s not the first.    (JW1)

A bit of history. The original Purple Numbers were a dark purple. Then Murray Altheim came up with a brilliant idea. Let’s make them lighter! So he did. That was better, but it wasn’t enough.    (JW2)

As Chris Dent and I started taking the identifier scheme to the next level, blogs started becoming popular. Permalinks, as rabble and others have pointed out, are granular addresses. Because our new identifier scheme was uglier by design (universally unique IDs can get quite large), we decided to jump on the blogging bandwagon and use hashes instead.    (JW3)

Then a funny thing happened. Peter Yim, a long-time Engelbart follower and an avid PurpleWiki user, complained. A lot. He said it was too hard to figure out what the links were. And as much as I tried to ignore him, I couldn’t. He was right. Or, more accurately, we were wrong. So we changed it back.    (JW4)

Our decision to go with hashes was wrong specifically in the context of PurpleWiki. Wikis are wonderful because you can Link As You Think. This is possible because page names are automatically linked, and it’s easy to remember page names. We added syntax for easily linking to Purple Numbers (for example, Purple Numbers). The problem was that when we replaced the addresses with hashes, we made it harder to Link As You Think.    (JW5)

With WikiSym coming up in a few days, I’ve had Wikis on my mind big-time, and I recently had an epiphany. It turns out I was wrong about being wrong.    (JW6)

The identifiers underlying Purple Numbers are designed to be stable, unique, and meaningless. In other words, not human-friendly. The notion of linking to Purple Numbers via our extended Wiki syntax isn’t tremendously useful, even if the identifiers are easily visible, because they’re impossible to remember. You’re rarely going to Link As You Think, because you’ll most likely have to go to the page to check what the number is. If you’re going to do that, you might as well just cut-and-paste the link, the status quo of the web.    (JW7)

Phil Jones almost stumbled onto something quite profound in his commentary last May, but he couldn’t quite put his finger on it, and Chris and I consequently jumped all over him. We were right, of course, but Phil was onto something.    (JW8)

In a Wiki context, here’s the right way to use Purple Numbers. By default, Purple Numbers should look pretty. So far, the best scheme I’ve seen for this is Simon Willison‘s nifty CSS hack. If folks want to link to a Purple Number in a Wiki, they can do it the way you do it anywhere else — get the link address by clicking on the number (or hash or paragraph symbol or whatever), and copy-and-paste it into your document.    (JW9)

However, the Wiki should add an additional feature: the ability to add a human-friendly label to any paragraph. Some Wikis implement this capability by using special WikiText tags, but you should be able to implement this using AJAX goodness.    (JWA)

In other words, suppose you’re reading a Wiki page, and you find one paragraph particularly compelling. All you do is click on the paragraph and add your human-friendly tag. Let’s say the page is DeepThoughts and you enter the label indeed. The label should appear next to the paragraph, and anytime somebody wants to link to that paragraph, they just type DeepThoughts#indeed.    (JWB)

Here’s where it gets cute. Doing this feels like tagging. You’re just tagging granular content instead of documents… which is what Purple Numbers are designed to enable in the first place. So, make the label a tag across the entire Wiki. In other words, if you click on the label indeed, you get a search page showing all paragraphs on the Wiki that have been tagged indeed. If you really want to be cute, you can make it a Technorati Tag so it gets crawled.    (JWC)

This makes cosmic sense. Both Wikis and tagging work when labels are not unique — the exact opposite requirement of Purple Numbers. You want namespace clash, and Wikis and tagging give you that. This way, you get the best of all worlds. You still have the immutable, unique Purple Numbers, but now they’re not so ugly. You also get Link As You Think granular addresses and granular tagging. Everyone wins.    (JWD)

(By the way, rabble, looking forward to seeing pretty Purple Numbers in typo!)    (JWE)