Wiki Analytics at the Wikithon

I got to put on my hacker hat for a day (a very rare occurrence for me these days) last Wednesday at the Wikithon. After trolling around for ideas, I decided to work on Wiki Analytics with Matthew O’Connor. We ended up dominating the competition and winning the contest for best hack. (So what if there were only two teams eligible for two prizes?)    (LRI)

http://farm1.static.flickr.com/150/386792217_6d63faa621_m.jpg    (LRJ)

Our driving question was: How can we measure the health of a Wiki? I don’t think there is one best way to use a Wiki, but there might only be three or four. If we can start teasing out patterns of Wiki usage, we can better understand how people collaborate with Wikis, which will help us better facilitate Wiki communities and improve Wiki software. Our goal was to tease out the patterns.    (LRK)

We used data from 266 public Socialtext workspaces and Socialtext‘s internal corporate workspace. You can read the details of our brainstorming and work on the Socialtext STOSS Wiki. Our approach was to simplify our tasks so that we could have something to show at the end of the day. It was decidedly practical, but it also reflected a deeper philosophy about Wiki Analytics. Start simple and evolve. You can learn interesting things from even simple measurements.    (LRL)

Results    (LRM)

We chose to focus on two types of analysis: page name and graph (link) analysis. I hacked on the former; Matthew on the latter.    (LRN)

Frequent followers of this blog have heard me say it before: Link As You Think is what makes Wikis powerful. The better your page names, the more interlinked your repository will be as you Link As You Think. In order to see if I could measure “good” page names, I looked at three things:    (LRO)

  • Length    (LRP)
  • Number of tokens (words)    (LRQ)
  • Number of non-alphanumeric characters    (LRR)

The hypotheses are straightforward. Shorter names are better. Names with fewer tokens (words) are better. Names without non-alphanumeric characters are better. (This last hypothesis is complicated by internationalization.)    (LRS)

You can read the results of my analysis. The workspaces on the index page are ordered largest to smallest. The top two workspaces are full of spam and can be safely ignored. The numbers on the index page are buggy; click through to the individual pages to see the correct numbers.    (LRT)

Matthew studied the graph characteristics of the Wikis, specifically:    (LRU)

  • Number of links (forward and back) versus number of pages    (LRV)
  • Number of islands (clusters of pages that are strongly connected to each other) and their sizes (number of pages on an island)    (LRW)

Islands of one are orphan pages (not linked to anywhere) and are undesirable. Large islands are better (or at least more interesting) than small ones.    (LRX)

You can view Matthew’s results on his site.    (LRY)

Analysis    (LRZ)

To give you an idea of what the stats mean, let’s look at four Wikis:    (LS0)

The mean number of characters and number of tokens for page names on each Wiki were:    (LS5)

  • 21.3 / 3.1 (stoss)    (LS6)
  • 18.6 / 2.4 (speakers)    (LS7)
  • 17.4 / 1.7 (st-rest-docs)    (LS8)
  • 39.3 / 6.7 (ivrwiki)    (LS9)

On the surface, the two Wikis in the middle — stoss and speakers — seem to have hit the sweet spot for page names: between two to three words per name. Since stoss is meant to be a collaborative workspace for a larger community, this seems to be a healthy number. The speakers Wiki is a repository of potential speakers. Since the majority of pages consists of people’s names, the numbers (two, sometimes three words in a page name) make sense.    (LSA)

The remaining two Wikis diverge enough from this minute data set that we can infer some different patterns of usage. st-rest-docs documents Socialtext‘s REST API, so there are a lot of one word page names representing method names. Even though the average number of tokens is smaller, the average name length is comparable to the two Wikis in the middle. This also makes sense, given that the methods in a REST API are actually URI paths, which can get somewhat long.    (LSB)

On the surface, ivrwiki seems to exhibit the classic signs of a newbie dumping ground, with page names that are too long to be useful. However, if you dig deeper, you can see that that’s not the case. The standard deviation of number of tokens is quite large (4.2), indicating a flat distribution curve. In other words, while there are a lot of long names, there are also a lot of short names. If you dig even further, you’ll see that the community is using the Wiki as a question repository, and questions naturally have lots of words. Additionally, there seems to be a lot of more traditionally “Wiki-like” behavior on that Wiki.    (LSC)

This was no accident. The reason I’m showcasing ivrwiki is that Matthew identified it as an “interesting” Wiki from his graph analysis. Look at the numbers. There are three sizes of islands: 19 of one page, one of 16 pages, and one of 353 pages! That’s one big island! It indicates a fairly tight set of linkages across the majority of the pages on a Wiki. Dig a bit deeper, and you can see the hub of the cluster: the Knowledge Base Index page. It links to every page in the knowledge base, and every page in the knowledge base links back to this page.    (LSD)

The st-rest-docs Wiki exhibits similar behavior — one big island of 81 pages. This makes sense, given that this Wiki represents documentation, which is structured in a similar way to the ivrwiki knowledge base.    (LSE)

The stoss Wiki is the most Wiki-like of the four when you dig into the graph analysis. There are five sizes of islands, the largest containing 10 pages. The distribution is fairly regular — based on my guess of what “regular” should be, at least. To really get a sense of what “regular” should be, we’ll need to identify several Wikis that we consider to be “Wiki-like,” and examine those numbers.    (LSF)

Finally, look at the numbers for the speaker Wiki. The numbers are in reverse of the other Wikis. There is basically no clustering; all of the pages consist of islands in and of themselves. At first glance, this is surprising. You would expect it to look somewhat like ivrwiki and st-rest-docs. The reason for the lack of clustering is that this Wiki relies on Socialtext‘s tagging interface for navigation. Tags could be treated as a type of link, but we don’t treat them that way in our analysis.    (LSG)

Thoughts    (LSH)

As with any simplified analysis, there are always caveats. A lot of them are specific to the Wiki implementation. For example, several people at Socialtext use the stoss Wiki as a blog, which creates long page names and thus skews the statistics. Other Wikis may be similar to the speakers Wiki in that they use tags as navigational links.    (LSI)

There’s an open question as to whether or not to consider a Wiki a directed graph or not. We chose the former, but you can make a good argument that the Socialtext Wiki acts as a non-directed graph, or at least a bidirectional one, because Backlinks are displayed on the page itself. The same holds true with any other Wiki depending on the navigational context. If I start at the home page and start navigating around, I can often use the browser back button to go back, or at worst, I can click on “Backlinks” to figure out the context.    (LSJ)

I’m not sure the page name analysis is that interesting by itself. I think it gets very interesting when applied to the specific islands on a Wiki. People may be using a Wiki in a number of different ways, as demonstrated by the ivrwiki. Analysis on each individual cluster will potentially surface the different kinds of behaviors on a Wiki, which is more appropriate than trying to slap on a single archetype if one does not exist.    (LSK)

Finally, what level of clustering is healthy? In systems theory, networks that are either too tightly clustered or too lightly clustered are problematic. With enough analysis, we may be able to speculate on the right number for Wikis.    (LSL)

Matthew and I will release our code at some point, and we’ll hopefully have some time to follow up on it as well. Specifically, I’d like to examine a lot of other Wikis, starting with the ones that Blue Oxen Associates hosts.    (LSM)

There were a lot of other hacks at the Wikithon that were cool. My favorites were Ingy dot Net‘s Social Zork (which was not only hilarious, but is actually potentially useful) and Shawn Devlin‘s Word Cloud, which I hope to use on other Wikis. Christine Herron wrote a good summary of the day’s festivities.    (LSN)

Leave a Reply