Wiki Analytics at the Wikithon

I got to put on my hacker hat for a day (a very rare occurrence for me these days) last Wednesday at the Wikithon. After trolling around for ideas, I decided to work on Wiki Analytics with Matthew O’Connor. We ended up dominating the competition and winning the contest for best hack. (So what if there were only two teams eligible for two prizes?)    (LRI)

http://farm1.static.flickr.com/150/386792217_6d63faa621_m.jpg    (LRJ)

Our driving question was: How can we measure the health of a Wiki? I don’t think there is one best way to use a Wiki, but there might only be three or four. If we can start teasing out patterns of Wiki usage, we can better understand how people collaborate with Wikis, which will help us better facilitate Wiki communities and improve Wiki software. Our goal was to tease out the patterns.    (LRK)

We used data from 266 public Socialtext workspaces and Socialtext‘s internal corporate workspace. You can read the details of our brainstorming and work on the Socialtext STOSS Wiki. Our approach was to simplify our tasks so that we could have something to show at the end of the day. It was decidedly practical, but it also reflected a deeper philosophy about Wiki Analytics. Start simple and evolve. You can learn interesting things from even simple measurements.    (LRL)

Results    (LRM)

We chose to focus on two types of analysis: page name and graph (link) analysis. I hacked on the former; Matthew on the latter.    (LRN)

Frequent followers of this blog have heard me say it before: Link As You Think is what makes Wikis powerful. The better your page names, the more interlinked your repository will be as you Link As You Think. In order to see if I could measure “good” page names, I looked at three things:    (LRO)

  • Length    (LRP)
  • Number of tokens (words)    (LRQ)
  • Number of non-alphanumeric characters    (LRR)

The hypotheses are straightforward. Shorter names are better. Names with fewer tokens (words) are better. Names without non-alphanumeric characters are better. (This last hypothesis is complicated by internationalization.)    (LRS)

You can read the results of my analysis. The workspaces on the index page are ordered largest to smallest. The top two workspaces are full of spam and can be safely ignored. The numbers on the index page are buggy; click through to the individual pages to see the correct numbers.    (LRT)

Matthew studied the graph characteristics of the Wikis, specifically:    (LRU)

  • Number of links (forward and back) versus number of pages    (LRV)
  • Number of islands (clusters of pages that are strongly connected to each other) and their sizes (number of pages on an island)    (LRW)

Islands of one are orphan pages (not linked to anywhere) and are undesirable. Large islands are better (or at least more interesting) than small ones.    (LRX)

You can view Matthew’s results on his site.    (LRY)

Analysis    (LRZ)

To give you an idea of what the stats mean, let’s look at four Wikis:    (LS0)

The mean number of characters and number of tokens for page names on each Wiki were:    (LS5)

  • 21.3 / 3.1 (stoss)    (LS6)
  • 18.6 / 2.4 (speakers)    (LS7)
  • 17.4 / 1.7 (st-rest-docs)    (LS8)
  • 39.3 / 6.7 (ivrwiki)    (LS9)

On the surface, the two Wikis in the middle — stoss and speakers — seem to have hit the sweet spot for page names: between two to three words per name. Since stoss is meant to be a collaborative workspace for a larger community, this seems to be a healthy number. The speakers Wiki is a repository of potential speakers. Since the majority of pages consists of people’s names, the numbers (two, sometimes three words in a page name) make sense.    (LSA)

The remaining two Wikis diverge enough from this minute data set that we can infer some different patterns of usage. st-rest-docs documents Socialtext‘s REST API, so there are a lot of one word page names representing method names. Even though the average number of tokens is smaller, the average name length is comparable to the two Wikis in the middle. This also makes sense, given that the methods in a REST API are actually URI paths, which can get somewhat long.    (LSB)

On the surface, ivrwiki seems to exhibit the classic signs of a newbie dumping ground, with page names that are too long to be useful. However, if you dig deeper, you can see that that’s not the case. The standard deviation of number of tokens is quite large (4.2), indicating a flat distribution curve. In other words, while there are a lot of long names, there are also a lot of short names. If you dig even further, you’ll see that the community is using the Wiki as a question repository, and questions naturally have lots of words. Additionally, there seems to be a lot of more traditionally “Wiki-like” behavior on that Wiki.    (LSC)

This was no accident. The reason I’m showcasing ivrwiki is that Matthew identified it as an “interesting” Wiki from his graph analysis. Look at the numbers. There are three sizes of islands: 19 of one page, one of 16 pages, and one of 353 pages! That’s one big island! It indicates a fairly tight set of linkages across the majority of the pages on a Wiki. Dig a bit deeper, and you can see the hub of the cluster: the Knowledge Base Index page. It links to every page in the knowledge base, and every page in the knowledge base links back to this page.    (LSD)

The st-rest-docs Wiki exhibits similar behavior — one big island of 81 pages. This makes sense, given that this Wiki represents documentation, which is structured in a similar way to the ivrwiki knowledge base.    (LSE)

The stoss Wiki is the most Wiki-like of the four when you dig into the graph analysis. There are five sizes of islands, the largest containing 10 pages. The distribution is fairly regular — based on my guess of what “regular” should be, at least. To really get a sense of what “regular” should be, we’ll need to identify several Wikis that we consider to be “Wiki-like,” and examine those numbers.    (LSF)

Finally, look at the numbers for the speaker Wiki. The numbers are in reverse of the other Wikis. There is basically no clustering; all of the pages consist of islands in and of themselves. At first glance, this is surprising. You would expect it to look somewhat like ivrwiki and st-rest-docs. The reason for the lack of clustering is that this Wiki relies on Socialtext‘s tagging interface for navigation. Tags could be treated as a type of link, but we don’t treat them that way in our analysis.    (LSG)

Thoughts    (LSH)

As with any simplified analysis, there are always caveats. A lot of them are specific to the Wiki implementation. For example, several people at Socialtext use the stoss Wiki as a blog, which creates long page names and thus skews the statistics. Other Wikis may be similar to the speakers Wiki in that they use tags as navigational links.    (LSI)

There’s an open question as to whether or not to consider a Wiki a directed graph or not. We chose the former, but you can make a good argument that the Socialtext Wiki acts as a non-directed graph, or at least a bidirectional one, because Backlinks are displayed on the page itself. The same holds true with any other Wiki depending on the navigational context. If I start at the home page and start navigating around, I can often use the browser back button to go back, or at worst, I can click on “Backlinks” to figure out the context.    (LSJ)

I’m not sure the page name analysis is that interesting by itself. I think it gets very interesting when applied to the specific islands on a Wiki. People may be using a Wiki in a number of different ways, as demonstrated by the ivrwiki. Analysis on each individual cluster will potentially surface the different kinds of behaviors on a Wiki, which is more appropriate than trying to slap on a single archetype if one does not exist.    (LSK)

Finally, what level of clustering is healthy? In systems theory, networks that are either too tightly clustered or too lightly clustered are problematic. With enough analysis, we may be able to speculate on the right number for Wikis.    (LSL)

Matthew and I will release our code at some point, and we’ll hopefully have some time to follow up on it as well. Specifically, I’d like to examine a lot of other Wikis, starting with the ones that Blue Oxen Associates hosts.    (LSM)

There were a lot of other hacks at the Wikithon that were cool. My favorites were Ingy dot Net‘s Social Zork (which was not only hilarious, but is actually potentially useful) and Shawn Devlin‘s Word Cloud, which I hope to use on other Wikis. Christine Herron wrote a good summary of the day’s festivities.    (LSN)

Socialtext 2.0 Released

Congratulations to Ross Mayfield, Peter Kaminski, Adina Levin, and all the excellent folks at Socialtext for the release of Socialtext 2.0. Even bigger props for slipping in “Purple Consulting” in the screencast. I’ve been cranking so hard over the past six months, I didn’t have a chance to congratulate them on their Open Source release last July, so now I get to combine my commentary here. (In fact, I’m sitting on a bunch of Wiki-related posts right now that I need to push out; a lot of really cool stuff has been happening.) That’s good, because I have plenty to say.    (L72)

Socialtext 2.0 is an important release for three reasons. First, it doesn’t just look good, it’s highly usable. Adina and Pete deserve big-time credit for this. They’ve spent months painstakingly experimenting and testing the design. More importantly, they haven’t just focused on making it easy to use, but they’ve also agonized over how to accomodate expert usage as well.    (L73)

Have they succeeded? I think the personal home base concept is great. I love the fact that Backlinks are visible on the page and get lots of love. I love their new Recent Changes interface (and I hope to see a Tag Cloud view of the all pages index in the next release). I hate the fact that a Recent Changes link is not on every Wiki page. Both Pete and Adina are well aware of this beef, and I’m also well aware of their reason for not including it. Testing and user observation will tell what’s better.    (L74)

Second, Socialtext 2.0 has a really cool REST interface. Chris Dent has been boasting about it for months, but I didn’t look at it myself until Kirsten Jones walked me through it last week. (Her WikiWednesday presentation from earlier this month is online.) It really is cool, and it’s also useful. Congrats to Chris, Kirsten, Matthew O’Connor, and Matt Liggett for their excellent work!    (L75)

What’s great about this API is that it could very well serve as a standard URI scheme for all Wikis. This would obviate the need for a separate SOAP or Atom API. You just have a regular Web app, and you get the API behavior for free.    (L76)

For example, Alex Schroeder‘s currently going through the same process that Chris went through a year ago with Atom and OddMuse. An easier way around this problem would be to implement these REST APIs.    (L77)

(This is also a great opportunity for me to mention WikiOhana again, which gained great traction at WikiSym last month and which now has a lively Wiki of its own. PBWiki recently announced its own Wiki API, which is a good thing. We are all part of the same Wiki family. Socialtext and PBWiki need to talk about how their two efforts can work together. That’s the WikiOhana Way.)    (L78)

The third important thing about Socialtext 2.0 is that it’s Open Source. (Big props to Jonas Luster and Andy Lester for finally making this happen.) Here’s the thing. I think the announcement a few months back was overblown by a lot of blogosphere hype. The reality of all corporate Open Source releases is that — in and of themselves — they’re mostly meaningless. Mostly, but not completely. The fact that Socialtext 2.0 is Open Source means that other Wiki implementations can benefit from the great work that the Socialtext developers have done, from the APIs to the user interface. That makes for a healthier ecosystem, which is good for everybody.    (L79)

That said, the reason the actual open sourcing of Socialtext 2.0 (and any proprietary software project) is mostly meaningless is that the license is a critical, but tiny part of what makes Open Source software interesting and important. The big part is the community and collaborative process, and a lot of other things besides an open license are required to make that successful.    (L7A)

Before Socialtext went Open Source, I spent many hours talking to a bunch of people there about the impending release. I wanted to know how committed they were to making this a truly open and collaborative software project, because I felt the potential impact on the Wiki community was enormous. The answer I got was complex. The fact that everyone was willing to talk to me with no strings attached, in and of itself, demonstrated a commitment to openness, and I’m still grateful for that. The code itself will be a short-term bottleneck, as it needs a lot of work before outside developers will find it compelling. I also think the licensing terms are weaker than they need to be, although I also understand the outside pressures that make it so.    (L7B)

In short, I think the spirit is strong within Socialtext to fully realize the potential of this Open Source project, but there are also roadblocks. Hopefully, external pressures won’t squash that spirit. If Socialtext ever fulfills its potential as an Open Source company, it will not only help the ecosystem, but it will also tremendously benefit Socialtext as a business.    (L7C)

Queer Numbers

At BAR Camp, I ran into Kragen Sitaker who had an idea for a variant on Purple Numbers called Queer Numbers. Kragen recently blogged the idea (spotted by Matthew O’Connor).    (JWJ)

In brief, Purple Numbers are wonderful, assuming the author has generated them. If the author hasn’t, you can use a proxy, such as PurpleSlurple. The problem with PurpleSlurple is that the addresses aren’t stable. If the author inserts a paragraph into the document, the PurpleSlurple address will point to the wrong place.    (JWK)

Queer Numbers solve this problem by generating stable (maybe) identifiers based on some content analysis. Using this algorithm, you can address granular content on any page and feel fairly confident that the link will go to the right place. The level of confidence is still up in the air, as Kragen notes in his blog post.    (JWL)

Kragen referenced some work on lexical signatures for persistent naming of Web pages. (Ironically, Kragen didn’t have the link, and the original link is broken!) That work was Thomas Phelps and Robert Wilensky‘s Robust Hyperlinks, and it’s good stuff.    (JWM)

Some additional prior art: Doug Engelbart once told me that his lab had explored the idea of generating granular addresses through a hashing algorithm similar to Kragen’s. (Great minds think alike!) If I recall, their algorithm was less sophisticated than Kragen’s, and I don’t think they got too far with the idea, but I’ll have to double check with Doug to be sure.    (JWN)

About four years ago, I met a fellow named Alon Schwartz through Doug. Alon had founded an Israeli startup called BrowseUp, where he had independently come up with ideas such as granular linking and Backlinks, only to discover that Doug had thought of these ideas a half century earlier. Alon was delighted by this discovery and tried to convince Doug to join forces, but Doug wasn’t interested in getting involved with proprietary software, and BrowseUp eventually suffered the fate of most Dot Coms.    (JWO)

BrowseUp‘s product was a proxy server and browser plugin that gave you granular linking, backlinks, and link types to existing web content. It was pretty cool, and it’s too bad it never got much attention. Alon used a hashing algorithm to generate unique granular addresses that he claimed were over 90 percent stable across different versions of a document. Of course, he wouldn’t tell me what the algorithm was, because the product was proprietary.    (JWP)

I think Kragen’s onto something good, and I hope he’ll turn his idea into code soon so that we can start playing with Queer Numbers in earnest.    (JWQ)

TPVortex: Intro, Call For Help

In my manifesto for collaborative tools, I cited Backlinks as an example of a common, yet oft-overlooked conceptual construct in collaborative tools. Those who know me well know that my strategy for implementing some of Doug Engelbart‘s ideas (which I crafted over three years ago) has always been to create simple, concrete tools that could easily be shoehorned into existing applications. The plan was to start with Granular Addressability (Purple Numbers), then move on to Backlinks.    (247)

For a number of reasons, now seems to be the right time for me to start shifting my technical focus to Backlinks. The strategy for doing this is to implement a generic, Open Source, Backlink database (dubbed “TPVortex” and integrate it into several existing tools: PurpleWiki, blosxom, MovableType, MHonArc. I’m looking for folks who might be interested in participating in this project.    (248)

The motivation for such a tool is straightforward: Backlinks provide useful, contextual information. Most Wikis already implement Backlinks. Some of them display Backlinks on the main page, which is the correct behavior. Others (including PurpleWiki) do not. In order to implement this properly, you need a Backlink database.    (249)

Once you have a Backlink database, you might as well use it for other applications besides Wikis, such as blogs. We have this integration in PurpleWiki (see Wikis As Topic Maps for the resulting benefits), but again, it would be much nicer to display the Backlinks on the page itself rather than requiring a person to click on a link to see them. In order to implement this properly, the database has to store document metadata, such as title and author, not just the Backlink. For this reason, I think that TPVortex should use an RDF database on the backend.    (24A)

Other thoughts:    (24B)

I welcome help in all forms — comments, critiques, and especially coding. I’ve set up a Wiki page at the Collaboration CollaboratoryCollab:TpVortex — to serve as the center of design discussions. If you’re interested in contributing or commenting, please do it there. Feel free to drop me an email as well.    (24F)

ChiliPLoP, Day 3

Last Thursday, my workshop met for a second day. Having agreed on a working definition for collaboration (see Collab:Collaboration), we started working on the Pattern Language. As was the case the previous day, I knew exactly what I wanted to accomplish, and I made that clear when we got started. What differed this day, however, was that Linda Rising, Ofra Homsky, and Joe Yoder — our three experienced Pattern Language authors — led the way in terms of process.    (1CT)

We began by laying out the index cards we had collected the previous day onto a table. The goal was to see what patterns we had and what seemed to be missing. The definition that we had collectively agreed on the day before helped us tremendously with this process. For example, because collaboration — as we defined it — required bounded goals, that meant there were patterns related to the start and end of the collaborative process. There were also patterns related to interaction (meetings for example) and knowledge exchange (Shared Display).    (1CU)

Mapping out our cards also helped us identify gateways to other Pattern Languages, such as Linda and Mary Lynn Manns‘s patterns for introducing new ideas into organizations, Ofra’s patterns for leadership, Jim Coplien and Neil Harrison‘s organizational patterns, and GivingSpace‘s patterns of uplift.    (1CV)

Lots of brainstorming and storytelling happened throughout. My favorite was a story that Joe Yoder told about a factory where he had previously worked, which literally left its financial books open on the factory floor. Anyone who worked at the company could examine the books and suggest improvements. The open books were a form of Think Out Loud that showed that the company treated its operations as a collaborative process involving all of its employees, regardless of position. Tremendously empowering stuff.    (1CW)

Linda, Ofra, and Joe constantly stressed the importance of iteration and cautioned Josh Rai and me about getting too caught up with formality too early in the process. Ever fearful of being berated by Ralph Johnson or Jim Coplien, I would periodically complain, “That name isn’t a noun phrase!” Fortunately, the rest of the group kept me on track. We had plenty of time to weed out and refine our patterns after the brainstorming process.    (1CX)

We ended our brainstorming at lunch, at which point we had 36 cards. After lunch, we picked two patterns — Collab:StoneSoup and Collab:KickOff — and Linda led us through a group pattern writing exercise. (I’ll say more about these two patterns when I describe Day 4.) She gave us a letter-sized piece of paper for each component of the Coplien Form (name, problem, context, forces, solution, rationale, resulting context, known uses, and related patterns). Each of us took one piece of paper, wrote down our ideas, then exchanged it with someone else for another piece of paper. The cycle continued until we all had our say to our satisfaction. Afterwards, we discussed what we had written.    (1CY)

This was the first time Linda had tried this particular exercise, and I think it worked very well. It was particularly good at helping us reach Shared Understanding. We all had slightly different views of both patterns. Actually going through the group writing process helped make these differences explicit, at which point we were able to talk through our differences.    (1CZ)

Because Josh and I were the pattern-writing newbies in the group, we each collected the sheets for one of the patterns and promised to combine, edit, and rewrite them into a readable draft. I chose Collab:StoneSoup; Josh took Collab:KickOff. The plan for Day 4 (which was only a half day) was to workshop our results.    (1D0)

I ended the day with a brief overview of how blogs and Wikis integrated with Backlinks could be used to tie stories with corresponding patterns.    (1D1)

Chili Beer    (1D2)

Since that night was our last in Carefree, I decided to organize a margarita BOF. Earlier, somebody had told us about the Satisfied Frog, a legendary Mexican restaurant and bar that had “a thousand different kinds of margaritas.” This was the obvious place to hold our BOF, so Josh, Jerry Michalski, Gerry Gleason, and I trekked on over.    (1D3)

As with most legends, the facts had been slightly exaggerated. The Satisfied Frog only served one kind of margarita, although in fairness, it did give us the option of frozen versus on-the-rocks and with or without salt.    (1D4)

The restaurant did, however, brew its own special beer — chili beer — which was bottled with a serrano chili pepper. It had a nice kick to it, but it wasn’t overpowering. I recommend it to those with a a penchant for adventure and a bit of a heat tolerance.    (1D5)