Eugene Eric Kim <eekim@eekim.com>
Version
1.0
May 27, 2001
On the surface, adding purple numbers to a Web site is an absurdly simple problem. However, the granular addressability issues underlying this process are not so straightforward. The real question is, what are the best addressing schemes and procedures for making Web pages granularly addressable? 1A (02)
In this paper, I describe my personal experiences adding purple numbers to my web site (http://www.eekim.com/). I explain the basic problem in detail, describe the tools and processes I used to tackle this problem, analyze the strengths and weaknesses of my approach, and discuss their larger implications for the OHS and DKRs. 1B (03)
HTML allows you to create links to any type of document on the World Wide Web. Additionally, you can link to parts of HTML documents using HTML anchors. For example, suppose you have the document shown in Example 1 located at http://foo/bar.html. 2A (05)
<html> <head><title>bar</title></head> <body> <h1>bar</h1> <p><a name="p1">This is the first paragraph.</p> </body></html>
You can create a link to this entire document, or you can link to the first paragraph of this document using the "p1" anchor, as shown in Example 2. 2C (07)
<a href="http://foo/bar.html">http://foo/bar.html</a> <a href="http://foo/bar.html#p1">first paragraph of bar.html</a>
The ability to create fine-grained links to a document is known as granular addressability, which is an important characteristic of advanced knowledge management systems. Unfortunately, HTML's support of this feature is limited. In order to create granular links to subsets of an HTML document, the author of that document must explicitly create named anchors within that document. If the author does not do this, then you will be unable to link to anything more granular than the entire document. 2E (09)
In order to get around this problem, Doug Engelbart suggested that authors create named anchors for every significant markup element in an HTML document, using the addressing scheme he invented for his groundbreaking Augment hypertext system. In order to visually distinguish these named labels from the content of the HTML document, Engelbart suggested that these anchors be displayed in a purple font. 2F (010)
Augment documents consist of a hierarchical tree of "statements," which are chunks of text. Augment does not keep track of the semantic type of a statement; for example, it does not distinguish a "header" statement from a "paragraph" statement. 3A (012)
Every statement has two built-in addresses -- a statement identifier (SID) and a structural location number. Additionally, the author can attach an optional label to each statement, essentially equivalent to attaching a named anchor to an HTML element. 3B (013)
An SID consists of a positive integer preceded by a zero. Every statement within a document contains a unique, immutable SID. 3C (014)
The structural location "number" is an alphanumerical address that designates the location of that statement within a document. The structural location numbers in the first level of the hierarchy consist of numbers that increment with each new statement on that level. The addresses on the next level consist of the parent address, followed by letters that increment with each new statement on that level. Numbers and letters alternate with each level in the hierarchy. 3D (015)
Example 3 is a depiction of an Augment document with its SID and structural location numbers. Statements are separated by blank lines, and hierarchies are designated by indentations. Each statement is preceded by its structural location address, a colon, its SID, and another colon. 3E (016)
Engelbart first used his purple numbers to designate addresses in Augment documents that had been translated into HTML and placed on the Bootstrap Institute web site. The translation was straightforward, but it was done manually. Statements were converted into paragraph, header, and list item elements, depending on their context. HTML named anchors were created for every SID, structural location number, and optional statement label. Finally, structural location numbers were added to the end of each statement in a purple font. 4A (019)
Augment provides the capability of displaying or not displaying any of the three possible addresses of a statement. However, this capability does not exist natively within Web browsers (although it could be simulated using JavaScript). Rather than display all three addresses in the translated document, Engelbart decided to display only the structural location number. 4B (020)
Engelbart's decision was justified for two reasons. First, the papers being translated were published papers that were not likely to change. Thus, links to a structural location number in a published paper would always point to the correct statement. Second, in his previous experience with Augment, people preferred to use the structural location number over SIDs for addressing, because the location numbers mapped logically to the physical layout of the document and were thus easier to remember and understand. An address of "0523" could be pointing to any statement in a document, where an address of "3B5" would always point to the fifth substatement of the second substatement of the third statement on the first level of the document. 4C (021)
Later, Engelbart decided to apply the purple numbers to all documents on the Bootstrap Institute's web site, not just the translated Augment documents. Unfortunately, this was done manually. Because it was a tedious and time-consuming process, the Bootstrap Institute web master opted not to include named anchors with SIDs, as structural location numbers were the only addresses that would be visible from the browser anyway. 4D (022)
There were several flaws with this approach, the worst being that it was a manual process. One fallout of the structural location addresses being visible was that every time a new statement was added, statements following that statement on the same level of the hierarchy had to be renumbered. When a new hierarchy was added, every statement below that hierarchy had to be renumbered. Handled automatically, this would be no problem, but handled manually, these requirements increased the difficulty of the task exponentially. 4E (023)
Additionally, structural location numbers were a poor choice of addresses for frequently changing documents. If a statement moved, the structural location number would change, and all links to the original address would either break or point to the wrong item. 4F (024)
As I became more involved with Engelbart's OHS project, I began writing some relevant technical papers. I wanted to publish these papers using purple numbers, and rather than add those numbers manually, I wrote Purple, a system for automatically adding purple numbers to documents. 5A (026)
Because I was using Purple specifically for published papers and not for general Web site use, I decided against writing a system that would automatically modify HTML documents. Instead, I decided to write my papers in an XML format and convert these into HTML. Rather than use the standard, but large DocBook XML vocabulary, I created a very small vocabulary -- defined by purple.dtd -- and wrote Perl scripts for adding addresses to the elements and XSLT scripts for translating the XML documents into HTML. 5B (027)
Purple was first released on January 31, 2001. It added both SIDs and structural location numbers to elements. Because the papers that I was writing had a good chance of changing over time, I decided to display the SID as well as the structural location number in purple, with the former in parentheses following the latter. 5C (028)
On April 24, 2001, Murray Altheim released plink, a program written in Java that had similar functionality to Purple. plink works on XHTML files, although it may be configured to work with other types of XML files. 5D (029)
I was very pleased to see plink, mainly because, as another independent implementation of the purple number concept, plink helped validate this rather simple, but significant idea. I was also able to cull a number of interesting ideas from Altheim, some of which I applied to Purple. 5E (030)
plink's most interesting feature is that it can be configured to work with other XML vocabularies. If all plink did were add SIDs to the proper elements, this would be an easy feature to program. All you would have to do would be to specify the "significant" elements in an XML document, and then have the software add SIDs to those elements. 5F (031)
What makes this task complicated is the algorithm for computing structural location numbers. For Augment documents, the algorithm is straightforward. For statements within a hierarchy, increment the current address element. For every new hierarchy, add a new address element. For XML documents, the algorithm is not so straightforward, namely because intended hierarchy is more difficult to define. 5G (032)
Consider the XHTML excerpt in Example 4. 5H (033)
<h1>Hello, World!</h1> <p>This is an article about greeting the world.</p> <h1>Greetings in all languages.</h1> <p>"Hello" in French is "Bonjour."</p>
In a DOM representation of this XML excerpt, all of the elements would be in the same level of hierarchy. If you were to assign structural location numbers accordingly, you might end up with something like Example 5. 5J (035)
1:Hello, World! 2:This is an article about greeting the world. 3:Greetings in all languages. 4:"Hello" in French is "Bonjour."
However, in this particular document, everything below the h1 element should probably be a subelement of h1, as indicated in Example 6. 5L (037)
1:Hello, World! 1A:This is an article about greeting the world. 2:Greetings in all languages. 2A:"Hello" in French is "Bonjour."
What if we were to change the second h1 element into an h2 element? That might indicate a document as shown in Example 7. 5N (039)
1:Hello, World! 1A:This is an article about greeting the world. 1B:Greetings in all languages. 1B1:"Hello" in French is "Bonjour."
In all of these cases, we are guessing the intended semantics of each element. However, these same elements could certainly be used in different ways for different documents. 5P (041)
XHTML is a markup language designed primarily for representing document formatting, despite the best efforts of the W3C to make it otherwise. The formatting semantics of each element are fairly well defined, but the structural semantics of the elements are highly context-dependent. Consequently, designing an algorithm that "correctly" assigns structural location numbers to XHTML documents is impossible, because what is correct will depend on the context of the document. 5Q (042)
Altheim circumvented this problem by defining three types of structural tags: division, significant, and wrapper. His algorithm for computing structural location numbers is based on these tag types, not on hard-coded tag values. By assigning elements to each of these three types, not only can you change the behavior of plink when parsing XHTML files, you can configure it to parse other XML vocabularies. 5R (043)
I found this approach compelling, and adopted a similar, but more complicated parsing scheme for Purple. While this resulted in a more configurable system, my implementation was still not general enough for all XML vocabularies, and in fact, was not even general enough for XHTML. 5S (044)
Why was this the case? I could configure Purple to use a plink-style algorithm for computing structural location numbers for XHTML files, but I could not configure Purple to add those addresses. Using purple.dtd, adding either an SID or a structural location number is simply a matter of setting the sid or hid (for hierarchical ID) attribute of an element. However, in XHTML, adding addresses to an element means adding named anchors to the beginning of the element's content, and adding the purple numbers to the end of the element's content. 5T (045)
My conclusions was that it was pointless to try and write a single, configurable, generalized parser to handle these two different requirements. It's far simpler to use a different parser for both of these situations, which is one reason why Purple and plink will coexist peacefully. 5U (046)
After both Purple and plink were released, Henry van Eyken, Bootstrap Institute's webmaster, was naturally interested in using one of these programs to add purple numbers to the Bootstrap Institute Web site. I suggested that plink was the better tool for van Eyken's task, because he was working solely with HTML documents. By translating those documents into XHTML -- a task that could be heavily automated using Dave Raggett's tidy -- van Eyken could run plink on all of the translated documents, and his task would be complete. 6A (048)
However, for my personal Web site, http://www.eekim.com/, deciding which tool was most appropriate was not as straightforward. Because the content of my Web site is constantly changing, I found it beneficial to use a template system. This would allow me to easily modify the look-and-feel of the entire site without having to modify every document. My site consists entirely of static HTML pages, automatically generated from templates built using Andy Wardley's Template Toolkit. Template Toolkit processes directives delimited by square brackets and percent signs. For instance, the directive for including a file within a template is "[% INCLUDE file %]". I store the templates in a CVS repository, which allows me to store a complete record of all the content that ever appeared on my Web site. 6B (049)
There were three possible approaches to adding purple numbers to my Web site. First, I could write all of the actual content of my site in XML, using purple.dtd or some other custom-designed vocabulary. I could then generate HTML with purple numbers from that XML, and then include that content into my templates with the INCLUDE directive. The second possibility was to scrap Template Toolkit entirely, and generate the entire content of the site from XML documents using XSLT templates. The third possibility was to run plink on the XHTML files generated by my templates. 6C (050)
In the end, I rejected all three of these approaches. Either of the first two would probably have been most ideal, but frankly, when I began tackling this problem, I was still in the process of refactoring my Web site, and did not have a strong enough grasp of the content on my Web site to create a unified set of semantically meaningful tags. I was constantly reinventing XHTML tags -- tags for tables, tags for lists, etc. -- which, of course, was silly. Rather than reinvent an XHTML-like vocabulary, I decided simply to use XHTML. 6D (051)
Also, I was reluctant to replace Template Toolkit in favor of an all XSLT solution, because Template Toolkit does its job so well. XSLT can be confusing and limiting at the same time, and I felt no need to skirt around its eccentricities when I already had a good solution in place. 6E (052)
The obvious solution was to use plink or a plink-like tool. Rather than use plink, I chose to write a similar tool in Perl called xhtml_purple. This was not a case of not-invented-here syndrome. I needed to make some changes to the software -- things to accomodate my use of Template Toolkit and things that I just wanted to do differently -- and I felt that writing my own code from scratch would be faster than modifying plink's source code. My results justified my rationale. xhtml_purple took me about an hour to write, test, and debug, and it consists of a mere 128 lines of Perl. 7A (054)
xhtml_purple does two things differently from plink. First, it preprocesses the template file to look like well-formed XML, and postprocesses it to remove this extraneous markup. Second, it does not add structural location numbers. 7B (055)
I excluded the structural location numbers mainly because they don't make much sense on a dynamic Web site with no version control. Granular links to those addresses will eventually break or become incorrect as the site's content evolves. Additionally, I had not consistently used the XHTML tags in a semantically unified way, and so the algorithm for computing structural location numbers would have been different depending on the page. Finally, I didn't have a good solution for adding structural location numbers to non-hierarchical structured data, such as tables. The challenge was not coming up with a scheme, but coming up with the right scheme. 7C (056)
Not having to deal with structural location numbers simplified xhtml_purple enormously. However, there were some other difficulties regarding the purple numbers themselves -- namely, determining which XHTML elements should have the numbers. In an ideal information management system, all available content is addressable. However, these purple numbers are not an ideal system. One of its main downsides is that having too many purple numbers on a page can decrease a site's usability, simply because it can be visually distracting. 7D (057)
Not including structural location numbers helped in this regard, because it reduced the amount of extraneous visible content. Selectively using purple numbers for certain elements in certain contexts improved the usability even more. For example, I decided that the header and footer elements of a page did not need purple numbers. This decision could have complicated the implementation of xhtml_purple; for instance, it would have had to have distinguished a p element in the header of a page from the p element in the body of the page. Fortunately, xhtml_purple did not have to deal with this, because it never saw the elements in the headers or footers. These were hidden away in Template Toolkit directives, and hence, were never seen by xhtml_purple. 7E (058)
I also decided not to add purple numbers to table elements. The problem was that I could not come up with a general algorithm for determining which table elements should have purple numbers. In some cases, I wanted an entire row to be the most granular element, whereas in other cases, I wanted purple numbers for each table cell. There were even some cases where I wanted the entire table itself to have only one purple number. Rather than customize xhtml_purple to assign SIDs to table elements on a case-by-case basis, I chose to ignore them altogether. 7F (059)
Taken at face value, a purple number is simply that -- a purple number on the screen. If all you wanted to do was to add purple numbers to a Web page in a visually pleasing way, the resulting process would not merit any discussion. Underlying these purple numbers, however, is the notion of granular addressability. This is where things start to get interested -- and complicated. 8A (061)
Purple numbers are a scheme that mostly applies to (X)HTML. When I say that I am adding purple numbers to a document, what I mean is that I am adding named anchors to an HTML element and displaying the addresses in purple at the end of the element. If there is some XML that is used to generate the HTML, what I am doing is attaching addresses to the XML document that will subsequently be converted into HTML purple numbers. 8B (062)
XML doesn't need purple numbers. XML is about structured content, not about visual presentation. What XML does need is granular addressability. XML has this in the form of XPath, which, in my view, obviates the need for structural location numbers in most documents (more on this in a second). However, XML does not, by default, have SIDs, or immutable addresses attached to significant elements. Tools such as Purple and plink that want to support XML documents should, at minimum, add and maintain these SIDs. Both Purple and plink already do this. 8C (063)
With this in mind, I learned a number of lessons in developing Purple and adding purple numbers to my own Web site: 8D (064)
Structural location numbers had a very important role in Engelbart's Augment system, and are still relevant for documents with a clear hierarchical structure. However, they do not make sense for addressing non-hierarchical structured data, such as tables. Additionally, as I explained earlier, there are instances where a document's structure is clearly hierarchical, but the nature of the hierarchy is somewhat ambiguous. 8E1 (066)
Having an addressing scheme based on the location of the content makes sense, but these schemes should depend on the document type. XML documents already have such a scheme in XPath. Unlike structural location numbers, XPath is based on a standard data model for XML documents, and thus has an unambiguous notion of document hierarchy. 8E2 (067)
Do structural location numbers belong on Web sites? For static documents with clear hierarchy, they are still valuable. However, for constantly changing content, I feel that structural location numbers distract more than help. 8E3 (068)
Broken links are caused when documents are deleted or moved. We can solve half of this problem by never deleting documents, which is what a version control system does. Without version control, you cannot have true link integrity. 8F1 (070)
Version control also solves one of the problems with structural location numbers by enabling what I call link evolution. There are times when you want a link to evolve along with a document. For instance, if you have a link to the second paragraph of a document, and that second paragraph becomes the third paragraph, you want the link to point to the third paragraph of the new document. 8F2 (071)
SIDs naturally have this capability, but structural location numbers do not. However, as Engelbart pointed out, humans like using structural location numbers better than SIDs, and so it would be desirable to build this capability into these addresses. This is possible by maintaining a mapping between SIDs and structural location numbers for each version of a document. In the previous example, where you had a link pointing to the second paragraph of the first version of the document, you could evolve that link by determining the SID of that paragraph in the first version, and then figuring out the new structural location number of the paragraph with that SID in the second version. 8F3 (072)
Purple provides limited version control capabilities using a CGI program called dkr. Although I have found this program useful for serving a selected subset of papers, it would not have been a good solution for serving my entire Web site. I would have liked to have added a version control layer using a proper Web application framework that would have allowed the user to retrieve any document in the CVS repository. However, this was not viable for me, because I had limited access to my ISP's Web server, and hence, no way of writing a non-CGI interface between the server and CVS. 8F4 (073)
I have already concluded that SIDs are the most important addressing scheme in a DKR because of their immutability. However, SIDs have some problems of their own. 8G1 (075)
What should the SID be for transient content, such as content dynamically generated by a CGI program? If the content is being retrieved from a database, you could return the SID associated with the data in the database. However, what if the content is computed content as opposed to stored content? Suppose you have a CGI program that asks you to type in a mathematical expression. You type in "1+1", and the program returns "2". Should the "2" have an SID associated with it? 8G2 (076)
What about the SIDs for reused content? On every page on my Web site, I have a copyright notice, which, in fact, is stored in one file and reused multiple times. Suppose I wanted to make that notice addressable. Should the SID reflect the fact that it is transcluded content? 8G3 (077)
I have thoughts on all of these questions, but no firm answers. I did not have to deal directly with any of these questions for the current version of my Web site. However, they are important DKR issues that the OHS developers will eventually need to address. 8G4 (078)