Visualizing Wiki Life Cycles

On the first day of WikiSym in Denmark last August, I spotted Alex Schroeder before the workshop began and went over to say hello. Pleasantries naturally evolved into a discussion about Purple Numbers. (Yes, I’ve got problems.) Alex suggested that unique node identifiers were more trouble than they were worth, because in practice, nodes that you wanted to link to were static. Me being me, my response was, “Let’s look at the numbers.” Alex being Alex, he went off and did the measurements right away for Community Wiki, and he did some followup measurements based on further discussions after the conference.    (LSP)

As it turned out, the numbers didn’t tell us anything useful, but our discussions firmly implanted some ideas in my head about Wiki decay rates — the time it takes for information in a Wiki page to stop being useful.    (LSQ)

I had toyed with this concept before. A few years ago, I came up with the idea of changing the background color of a page to correspond to the age of the page. A stale page would be yellowed; an active page would be bright white. I had originally envisioned the color to be based on number of edits. However, I realized this past week that I was mixing up my metaphors. There have been a few studies indicating a strong correlation between frequent edits and content quality, so it makes sense to indicate edit frequencies ambiently. However, just because content has not been edited recently does not mean the information itself is stale. You need to account for how often the page is accessed as well.    (LSR)

(At the Wikithon last week, Kirsten Jones implemented the page coloring idea. She came up with a metric that combined edits and accesses, which she will hopefully document on the Wiki soon! It’s cool, and it should be easy to deploy and study. Ingy dot Net suggested that the page should become moldy, a suggestion I fully endorse.)    (LSS)

This past Sunday, I had brunch with the Socialtext Bloomington Boys. Naturally, pleasantries evolved into Matthew and me continuing along our Wiki Analytics track, this time with help from Shawn Devlin and Matt Liggett. We broke Wiki behavior into a number of different archetypes, then brainstormed ways to visually represent the behavior of each of these types. We came up with this:    (LST)    (LSU)

The x-axis represents time. The blue line is accesses; the green line is edits. Edits are normalized (edits per view) so that, under normal circumstances, the green line will always be below the blue (because users will usually access a page before editing it). The exception is when software is interacting with the Wiki more than people. The whole graph should consist of a representative time-slice in that Wiki’s lifespan.    (LSV)

The red line indicates the median “death” rate of Wiki pages. After much haggling, we decided that the way to measure page death was to determine the amount of time it takes for a page to reach some zero-level of accesses. We’ll need to look at actual data to see what the baseline should be and whether this is a useful measurement.    (LSW)

The red line helps distinguish between archetypes that may have the same access/edit ratio and curve. For example, on the upper left, you see idealized Wiki behavior. Number of edits are close to number of accesses, both of which are relatively constant across the entire Wiki over time. Because it’s a healthy Wiki, you’ve got a healthy page death rate.    (LSX)

On the upper right, you see a Wiki that is used for process support. A good example of this is a Wiki used to support a software development process. At the beginning of the process, people might be capturing user stories and requirements. Later in the process, they might be capturing bugs. Once a cycle is complete, those pages rapidly become stale as the team creates new pages to support a new cycle. The death line in this case is much shorter than it is for the idealized Wiki.    (LSY)

Again, one use of the Wiki isn’t better than the other. They’re both good in that they’re both augmenting human processes. The purpose of the visualization is to help identify the archetypes so that you can adjust your facilitation practices and tools to best support these behaviors.    (LSZ)

This is all theory at this point. We need to crunch on some real data. I’d love to see others take these ideas and run with them as well.    (LT0)

Wiki Analytics at the Wikithon

I got to put on my hacker hat for a day (a very rare occurrence for me these days) last Wednesday at the Wikithon. After trolling around for ideas, I decided to work on Wiki Analytics with Matthew O’Connor. We ended up dominating the competition and winning the contest for best hack. (So what if there were only two teams eligible for two prizes?)    (LRI)    (LRJ)

Our driving question was: How can we measure the health of a Wiki? I don’t think there is one best way to use a Wiki, but there might only be three or four. If we can start teasing out patterns of Wiki usage, we can better understand how people collaborate with Wikis, which will help us better facilitate Wiki communities and improve Wiki software. Our goal was to tease out the patterns.    (LRK)

We used data from 266 public Socialtext workspaces and Socialtext‘s internal corporate workspace. You can read the details of our brainstorming and work on the Socialtext STOSS Wiki. Our approach was to simplify our tasks so that we could have something to show at the end of the day. It was decidedly practical, but it also reflected a deeper philosophy about Wiki Analytics. Start simple and evolve. You can learn interesting things from even simple measurements.    (LRL)

Results    (LRM)

We chose to focus on two types of analysis: page name and graph (link) analysis. I hacked on the former; Matthew on the latter.    (LRN)

Frequent followers of this blog have heard me say it before: Link As You Think is what makes Wikis powerful. The better your page names, the more interlinked your repository will be as you Link As You Think. In order to see if I could measure “good” page names, I looked at three things:    (LRO)

  • Length    (LRP)
  • Number of tokens (words)    (LRQ)
  • Number of non-alphanumeric characters    (LRR)

The hypotheses are straightforward. Shorter names are better. Names with fewer tokens (words) are better. Names without non-alphanumeric characters are better. (This last hypothesis is complicated by internationalization.)    (LRS)

You can read the results of my analysis. The workspaces on the index page are ordered largest to smallest. The top two workspaces are full of spam and can be safely ignored. The numbers on the index page are buggy; click through to the individual pages to see the correct numbers.    (LRT)

Matthew studied the graph characteristics of the Wikis, specifically:    (LRU)

  • Number of links (forward and back) versus number of pages    (LRV)
  • Number of islands (clusters of pages that are strongly connected to each other) and their sizes (number of pages on an island)    (LRW)

Islands of one are orphan pages (not linked to anywhere) and are undesirable. Large islands are better (or at least more interesting) than small ones.    (LRX)

You can view Matthew’s results on his site.    (LRY)

Analysis    (LRZ)

To give you an idea of what the stats mean, let’s look at four Wikis:    (LS0)

The mean number of characters and number of tokens for page names on each Wiki were:    (LS5)

  • 21.3 / 3.1 (stoss)    (LS6)
  • 18.6 / 2.4 (speakers)    (LS7)
  • 17.4 / 1.7 (st-rest-docs)    (LS8)
  • 39.3 / 6.7 (ivrwiki)    (LS9)

On the surface, the two Wikis in the middle — stoss and speakers — seem to have hit the sweet spot for page names: between two to three words per name. Since stoss is meant to be a collaborative workspace for a larger community, this seems to be a healthy number. The speakers Wiki is a repository of potential speakers. Since the majority of pages consists of people’s names, the numbers (two, sometimes three words in a page name) make sense.    (LSA)

The remaining two Wikis diverge enough from this minute data set that we can infer some different patterns of usage. st-rest-docs documents Socialtext‘s REST API, so there are a lot of one word page names representing method names. Even though the average number of tokens is smaller, the average name length is comparable to the two Wikis in the middle. This also makes sense, given that the methods in a REST API are actually URI paths, which can get somewhat long.    (LSB)

On the surface, ivrwiki seems to exhibit the classic signs of a newbie dumping ground, with page names that are too long to be useful. However, if you dig deeper, you can see that that’s not the case. The standard deviation of number of tokens is quite large (4.2), indicating a flat distribution curve. In other words, while there are a lot of long names, there are also a lot of short names. If you dig even further, you’ll see that the community is using the Wiki as a question repository, and questions naturally have lots of words. Additionally, there seems to be a lot of more traditionally “Wiki-like” behavior on that Wiki.    (LSC)

This was no accident. The reason I’m showcasing ivrwiki is that Matthew identified it as an “interesting” Wiki from his graph analysis. Look at the numbers. There are three sizes of islands: 19 of one page, one of 16 pages, and one of 353 pages! That’s one big island! It indicates a fairly tight set of linkages across the majority of the pages on a Wiki. Dig a bit deeper, and you can see the hub of the cluster: the Knowledge Base Index page. It links to every page in the knowledge base, and every page in the knowledge base links back to this page.    (LSD)

The st-rest-docs Wiki exhibits similar behavior — one big island of 81 pages. This makes sense, given that this Wiki represents documentation, which is structured in a similar way to the ivrwiki knowledge base.    (LSE)

The stoss Wiki is the most Wiki-like of the four when you dig into the graph analysis. There are five sizes of islands, the largest containing 10 pages. The distribution is fairly regular — based on my guess of what “regular” should be, at least. To really get a sense of what “regular” should be, we’ll need to identify several Wikis that we consider to be “Wiki-like,” and examine those numbers.    (LSF)

Finally, look at the numbers for the speaker Wiki. The numbers are in reverse of the other Wikis. There is basically no clustering; all of the pages consist of islands in and of themselves. At first glance, this is surprising. You would expect it to look somewhat like ivrwiki and st-rest-docs. The reason for the lack of clustering is that this Wiki relies on Socialtext‘s tagging interface for navigation. Tags could be treated as a type of link, but we don’t treat them that way in our analysis.    (LSG)

Thoughts    (LSH)

As with any simplified analysis, there are always caveats. A lot of them are specific to the Wiki implementation. For example, several people at Socialtext use the stoss Wiki as a blog, which creates long page names and thus skews the statistics. Other Wikis may be similar to the speakers Wiki in that they use tags as navigational links.    (LSI)

There’s an open question as to whether or not to consider a Wiki a directed graph or not. We chose the former, but you can make a good argument that the Socialtext Wiki acts as a non-directed graph, or at least a bidirectional one, because Backlinks are displayed on the page itself. The same holds true with any other Wiki depending on the navigational context. If I start at the home page and start navigating around, I can often use the browser back button to go back, or at worst, I can click on “Backlinks” to figure out the context.    (LSJ)

I’m not sure the page name analysis is that interesting by itself. I think it gets very interesting when applied to the specific islands on a Wiki. People may be using a Wiki in a number of different ways, as demonstrated by the ivrwiki. Analysis on each individual cluster will potentially surface the different kinds of behaviors on a Wiki, which is more appropriate than trying to slap on a single archetype if one does not exist.    (LSK)

Finally, what level of clustering is healthy? In systems theory, networks that are either too tightly clustered or too lightly clustered are problematic. With enough analysis, we may be able to speculate on the right number for Wikis.    (LSL)

Matthew and I will release our code at some point, and we’ll hopefully have some time to follow up on it as well. Specifically, I’d like to examine a lot of other Wikis, starting with the ones that Blue Oxen Associates hosts.    (LSM)

There were a lot of other hacks at the Wikithon that were cool. My favorites were Ingy dot Net‘s Social Zork (which was not only hilarious, but is actually potentially useful) and Shawn Devlin‘s Word Cloud, which I hope to use on other Wikis. Christine Herron wrote a good summary of the day’s festivities.    (LSN)

The Story of Glormf: Lessons on Language and Naming

Jack Park recently asked about Link As You Think on the Blue Oxen Collaboration Collaboratory. I’ve written several blog posts on the matter, but there’s not much else out there. This was a great excuse for me to tell a few vignettes about Shared Language and the importance of names.    (KMO)

Glormf    (KMP)

This is Glormf, courtesy of the uber-talented cartoonist, Brian Narelle.    (KMQ)


Fen Labalme coined the term (originally spelled “glormph”) at an Identity Commons retreat in July 2003. We were strategizing about next steps, and we found that we were all struggling to describe what it was that we were all working on. Although we all had different views of the proverbial elephant, we were also convinced that we were talking about the same thing. In an inspired moment of clarity, Fen exclaimed, “It’s Glormf!” Much to our delight, Brian was listening to the conversation and drew Glormf for all of us to see.    (KMS)

Glormf’s birth lifted a huge burden off our shoulders. Even though Glormf was mucky, it was also real. We knew this, because it had a name and even a picture, and we could point to it and talk about it with ease. The name itself had no biases towards any particular view, which enabled all of us to use it comfortably. Each of us still had a hard time describing exactly what Glormf was, but if anyone challenged Glormf’s existence, any one of us could point to Glormf and say, “There it is.”    (KMT)

We had created Shared Language, although we hadn’t rigorously defined or agreed on what the term meant. And that was okay, because the mere existence of Shared Language allowed us to move the conversation forward.    (KMU)

Ingy’s Rule and Community Marks (KMV)

Ingy dot Net‘s first rule of starting a successful Open Source project is to come up with a cool name. I like to say that a startup isn’t real until it has a T-shirt.    (KMW)

Heather Newbold once told a wonderful story about how Matt Gonzalez’s mayoral campaign buttons galvanized the progressive community in San Francisco and almost won him the election. As people started wearing the green campaign buttons, she described the startling revelation that progressives in San Francisco had: There are others out there like me. A lot of them. I was amazed to hear her speak of the impact of this recognition, coming from a city that has traditionally been a hotbed of activism.    (KMX)

There’s a pattern in all of these rules and stories. I struggled to come up with a name for this pattern, and the best I could do for a long time was Stone Soup (courtesy of the participants in my 2004 Chili PLoP workshop). I loved the story associated with this name, the parable of how transformational self-awareness can be. But, it wasn’t quite concrete enough for my taste.    (KMY)

I think Chris Messina‘s term, “Community Mark“, is much better. Chris has actually fleshed out the legal implications of a Community Mark, which I recommend that folks read. Whether or not you agree with him on the details, the essence of Community Marks is indisputable: Effective communities have Community Marks. Community Marks make communities real, just as the term “Glormf” made a concept real. That’s the power of Shared Language.    (KMZ)

Pattern Languages and Wikis    (KN0)

Pattern Languages are all about Shared Language. Much of Christopher Alexander‘s classic, The Timeless Way of Building, is about the importance of names. In his book, Alexander devotes an entire chapter to describing this objective quality that all great buildings have. As you can imagine, his description is not entirely concrete, but he does manage to give it a name: “Quality Without A Name.” Call it a copout if you’d like, but if you use the term (or its acronym, “QWAN”) with anyone in the Pattern Language community, they will know what you’re talking about. Shared Language.    (KN1)

Ward Cunningham was one of the pioneers who brought Alexander’s work to the software engineering community. He created Wikis as a way for people to author and share patterns. Not surprisingly, an important principle underlying Wikis is the importance of names. Regardless of what you think about WikiWords, they have important affordances in this regard. They encourage you to think of word pairs to describe things, which encourages more precise names. They discourage long phrases, which also encourages precision as well as memorability. The more memorable a term, the more likely people will use it.    (KN2)

Ward often tells a story in his Wiki talks about using Class-Responsibility-Collaboration Cards to do software design. One of the things he noticed was that people would put blank cards somewhere on the table and talk about them as if there was something written there. The card and its placement made the concept real, and so the team could effectively discuss it, even though it didn’t have a name or description. (Ward has since formalized leaving CRC cards blank as long as possible as a best practice.) This observation helped him recognize the need and importance of Link As You Think, even if the concept (or Wiki page) did not already exist.    (KNG)

Open Source: Propagating Names    (KN3)

One of Blue Oxen‘s advisors, Christine Peterson, coined the term, “Open Source.” In February 1998, after Netscape had announced its plans to open source its browser, a few folks — Chris, Eric Raymond, Michael Tiemann, Ka-Ping Yee, and others — gathered at the Foresight Institute to strategize. At the meeting, Todd Anderson complained that the term, “Free Software,” was an impediment to wide-scale adoption. After the meeting, Christine called up Todd and suggested the term, “Open Source.” They both loved it. But, they didn’t know how to sell it.    (KN4)

So, they didn’t. At the followup meeting a few days later, Todd casually used the term without explanation. And others in the room naturally picked up on the term, to the point where they were all using it. At that point, they realized they had a good name, and they started evangelizing it to the rest of the community.    (KN5)

Names change the way we think about concepts, and so propagating names widely can shift the way people think about things. This is what happened with “Open Source.” This is what George Lakoff writes about in Moral Politics.    (KN6)

The mark of a good name is that people naturally start using it. A name can come from the top down, but it can’t generally be forced onto people.    (KN7)

Clean Hacking Stations

Fun Fact About Eugene #322: I’m obsessed with cooking shows. There’s nothing I like better on a Saturday morning than rolling out of bed, turning on PBS, and watching Jacques Pepin work his magic. What’s amazing about these chefs is that they always have a clean cooking station. Always. It’s apparently a principle they teach at cooking school, and it makes a lot of sense. It also seems to apply to other areas of life. Getting Things Done. Project Management. And of course, hacking.    (K95)

Ingy dot Net (The Hacker Formerly Known as Brian Ingerson) was in town this past week, and we hacked a little bit on Wednesday night. “Hacking” with Ingy for me so far has mostly consisted of me watching him in action, catching a typo here or there and occasionally pursuing some philosophical disagreement. But it’s cool, because I enjoy watching other folks code, especially folks who are better than me.    (K96)

(Earlier this month, while working on the Ruby YADIS library with Brian Ellin, I learned for the first time about command completion in Emacs using meta-backslash. Emacs has been my primary programming environment for about 15 years, and yet, I never knew about this. Very embarrassing.)    (K97)

One thing that surprised me about Ingy is that he doesn’t code very fast. On the other hand, one thing he does incredibly well is that he always has a “clean hacking station.” Even when he creates temporary directories or inserts debug statements in his code, he does it in a very clean way. It’s a practice I’d like to do a better job of emulating.    (K98)