eekim.com > Publications > "Programming CGI in C"

This article appeared in the November/December 1995 issue of Dr. Dobb's Sourcebook, and is reproduced here with permission from Miller Freeman, Inc. Copyright © 1995 Dr. Dobb's Journal.    (01)


Programming CGI in C    (02)

Sometimes Perl isn't the best tool for CGI programs    (03)

Eugene Eric Kim    (04)

Eugene is a history and science concentrator at Harvard University. He can be contacted at eekim@fas.harvard.edu.    (05)


The World Wide Web has rapidly become the most popular application on the Internet due to its simplicity and visual appeal. Perhaps its most important feature, however, is its interactivity. The ability to communicate both ways over the Web allows site maintainers to develop sophisticated programs that provide information based on user feedback.    (06)

Data is transmitted between the user and the server using a protocol called "Common Gateway Interface" (CGI). Although the CGI protocol is not difficult to understand, it can be intimidating at first, especially for the Web designer spoiled by the simplicity of HyperText Markup Language (HTML), the language used to format content on the Web. Additionally, programmers who are well-versed in CGI prefer to focus on developing the actual application rather than dealing with the internals of the protocol. Consequently, there is a genuine need for useful, efficient, and simple CGI-processing tools.    (07)

Such libraries have been developed in several languages, including Perl, C, and C++. In this article, I'll introduce cgihtml, a public-domain C library I wrote that can simplify CGI programming on a UNIX platform. I'll present examples of CGI programs here; the complete code for the library is available electronically; see "Availability," page 3.    (08)

CGI Specification    (09)

CGI describes the "gateway" for information between the Web application, the server, and the browser; see Figure 1. Every time you select a CGI program from your browser, either by filling out a form and pushing the Submit button or by selecting a link to a CGI program, data is sent to the server. The server then invokes the program and gives it your data in an encoded form. (For a description of how the data is encoded, see Example 1.) The data is either delivered to the standard input (the POST method) or stored in an environment variable (the GET method).    (010)

Once you have this data in some usable form, you can manipulate it.    (011)

Additionally, the server sets several environment variables for the CGI program to use. These environment variables usually contain useful information about the client, the server, and the current request. For example, the environment variable REMOTE_ADDR provides the IP address of the client connected to the server, while the variable HTTP_USER_AGENT defines the type of browser on the client machine. For a complete list of these environment variables, see http://hoohoo.ncsa.uiuc.edu/cgi/.    (012)

Although a CGI program does not need to receive any input from the browser (for example, when the program is invoked from within a <a href="..."> tag), it is required to send output to the browser. Data sent to standard out is received, interpreted, and displayed on the browser. Two sets of data are sent to the output: HyperText Transfer Protocol (HTTP) headers, and the actual information you wish to transmit.    (013)

There are several different HTTP headers that give the browser important information about the data it is about to receive. The most important header, known as "Content-Type:", tells the browser what type of information the server is sending, so the browser can interpret it accordingly. This header is of the form Content-Type: type/subtype, where type/type is a valid Multipurpose Internet Mail Extensions (MIME) type. The header is followed by a blank line.    (014)

The most common of these headers is Content-Type: text/html, the MIME type for HTML documents. Note that a blank line is required after the Content-Type, even if you are returning a blank page.    (015)

Another important header is Location:, which tells the browser to look for a file at a specified location. This is useful for writing scripts to redirect the browser to a different location. For example, to tell the browser to access the file file.html when it calls a CGI program, you need a CGI program that returns Location: file.html followed by a blank line.    (016)

One other HTTP header that is often useful is the no parse header (nph), which tells the Web browser that the server does not want to return any information. Nph headers are commonly used in imagemaps. To have a designated portion of an imagemap ignore the user's click, you have that portion pointing to a CGI script that returns an nph header. Nph headers are of the form shown in Example 2, followed by a blank line. You must include a Content-Type header even if you are not planning on returning any data.    (017)

To use the CGI protocol, you need a language that can read from the standard input, get environment variables, and write to the standard output. Since practically every computer language in existence can do all three, CGI programs may be written in whichever language suits you best.    (018)

C versus Perl    (019)

CGI programs must invariably parse plain text; Perl's high-level syntax, flexibility, and text-manipulation routines make it an ideal language in which to program CGI.    (020)

However, Perl and other very high-level scripting languages have limitations. One downside is their size. The Perl executable can be as much as ten times larger than CGI C binaries. While some of the CGI libraries for Perl greatly simplify programming, some do so at a cost in server performance. Since most servers fork a separate process every time a CGI program is invoked, overhead can grow rapidly on a high-traffic site with lots of CGI access.    (021)

Some Web servers (most notably Netscape and Apache) have their own APIs. These allow you to code your CGI programs as extensions to the server, thus avoiding the overhead created by forking new processes. Communicating with these APIs generally means coding your CGI programs in C.    (022)

Many specialized applications come with only C libraries. Additionally, you sometimes may require a high level of control over your program's actions. Only a lower-level language such as C can provide this control for all types of applications.    (023)

A CGI Library for C Programs    (024)

A properly implemented CGI library in C needs to strike a careful balance between usability and flexibility. My cgihtml library focuses almost entirely on providing routines for the most mundane CGI-specific tasks, such as decoding CGI input. When using cgihtml to code CGI programs that have specialized needs, such as type-conversion or advanced string-parsing routines, you must decide how best to implement these functions. Rather than attempt to completely hide the intricacies of CGI programming, cgihtml tries to complement and assist the programmer's skill.    (025)

The parsing routine read_cgi_input() determines whether the input is sent via the POST or GET method and interprets it accordingly, as in Figure 2. Parsed data is placed in a linked list of structures consisting of two elements: the name of the structure and the value.    (026)

cgihtml provides the cgi_val() routine to easily obtain the value of an element of the linked list, given the name. If you would rather search for elements with a given value or look for elements with a given combination of name and value, you can easily parse the list using one of the provided linked-list routines to return the value for any desired key.    (027)

All CGI programs must return a header, and most usually return some HTML as well. The routines in cgihtml are simply wrapper functions for the appropriate printf() call. These routines simplify outputting HTTP headers and HTML tags, which should encourage you to use proper, well-structured HTML rather than a hacked-together string of tags. Your code's readability will improve as well, since the function names express their purpose better than a printf() call would. For example, the code in Example 3(a) uses the cgihmtl library, which makes far more sense and is far more readable than the equivalent C code without the library, shown in Example 3(b).    (028)

There are also a number of more specialized routines that come with cgihtml to provide additional functionality and security.    (029)

Returning Query Results    (030)

To demonstrate use of this library, I'll build a rudimentary application: a generic query-results program that returns the names and values of everything entered in any form, such as the one in Example 4. This example form uses the POST method; however, you can use either GET or POST with query results.    (031)

The example assumes that the compiled query-results program is in your Web server's cgi-bin directory. The query-results program simply shows the name of each form item and the value entered by the user. First, the code needs to include the appropriate header files, namely cgi-lib.h and html-lib.h. Next, it instantiates an automatic variable called "item," a linked list that will store the names and values; see Example 5.    (032)

Next, the program needs to read the input. Remember that read_cgi_input() understands both the GET and POST methods automatically; you do not need a separate function to handle either case. The next statement in the program uses read_cgi_input() to read the item data into the linked list.    (033)

You now have a linked list with all the items entered from the form. You could iterate through the list yourself and print each entry using some of the linked-list routines provided by cgihtml. However, it's simpler to use the print_entries() function, which outputs each name and value using an HTML definition list (the <dl> tag).    (034)

Before you output your data, you must tell the browser what type of information you are sending by outputting a Content-Type header. This is accomplished here by the html_header() function. Finally, the list is cleared and the memory is freed up before exiting.    (035)

You now have a fully functioning CGI program in only a few lines of code. This code can be extended to do many things because each item in the linked list is easily accessible.    (036)

Programming Strategies    (037)

One of the most important issues related to CGI programming is security. A badly written CGI program can open up your system to anyone smart enough to manipulate it. In general, you should run your Web server as a nonexistent user (usually "nobody") to limit the damage someone could do if he or she were to break in via a CGI script.    (038)

Although running the program as a nonexistent user reduces the risk, it does not eliminate it. In CGI C programs, C functions that fork a Bourne shell process (system() or popen(), for example) present a serious potential security hole. If you allow user input into any of these functions without first "escaping" the input (adding a backslash before offending characters), someone malicious can take advantage of your system using special, shell-reserved "metacharacters." For instance, Example 6(a) may seem perfectly safe; it simply opens up a pipe to sendmail. However, since popen forks a shell, invoking the CGI script with the response in Example 6(b) as a value for "to:" will create the file I_HAVE_ACCESS on server's /tmp directory. Although this is a relatively harmless example, there are more serious possibilities.    (039)

In order to prevent malicious input into system() and related routines, cgihtml comes with an escape_input() function, which merely precedes every shell metacharacter in a string with a backslash. Example 6(c) is a modified, safe version of the code. Now if the user enters the response in Example 6(b), the semicolon will be preceded with a backslash before it is appended to the string command. In Example 6(d), the sanitized popen() command string will simply send mail to three bad addresses rather than allow a user to create files on the server machine from an unauthorized client.    (040)

Another common programming challenge occurs when you press the Stop button on your browser while a CGI script is still running. Although most servers receive a SIGNAL stating that the client has closed its connection, they rarely bother passing these messages on to the CGI program. Additionally, if your experimental CGI program has a bug and goes into an infinite loop, pressing the Stop button will not break you out of it.    (041)

The solution is to set an alarm to go off much later than the program needs to run and quit cleanly. If the alarm sounds, the program probably has a bug, so trap for that signal and call an appropriate exit function to deal with it. cgihtml comes with its own primitive die() function, which sends an error message to the Web browser, but you are encouraged to write your own die() to fit your needs. In C, this looks like Example 7. If this program is still running 30 seconds after launch, then it will automatically print an error message to the Web browser and quit.    (042)

One other task CGI programmers often face is content negotiation. The large number of existing browsers, each supporting different features, often frustrates the Web-page designer, who must design pages that look good on any browser. For example, an imagemap may look fantastic on a graphical browser with a T1 connection but is utterly useless to those with text browsers or slow Internet connections.    (043)

One way to deal with this dilemma is to use CGI scripts to determine what the browser is capable of displaying, then send the appropriate HTML file. There are several variables that you can check for different kinds of content negotiation. cgihtml comes with the function accept_ image(), which checks the HTTP_ ACCEPT environment variable to see whether the browser can view inline images. Other functions could be written to check environment variables; for example, HTTP_USER_AGENT could send pages that use the special features of several different browsers.    (044)

Example 8 is a CGI program that provides content negotiation. It assumes you have previously designed two HTML files: a very graphical one (index-img.html) and another that is text only (index-txt.html). When accessed, this program sends a graphical page to graphical browsers and a text-only page to text browsers.    (045)

Conclusion    (046)

Although Perl is currently the language of choice among Web programmers, increased server strain will provide a greater incentive for Web maintainers to write their code in lower-level, more-efficient languages such as C. As cgihtml shows, well-written libraries can simplify CGI programming in C without restricting C's flexibility and power.    (047)


Examples    (048)

Example 1: The encoding scheme for CGI input. Names and values are separated by =, records are separated by &, spaces are replaced by +, and special characters are preceded by \.    (049)

name1=value1a+value1b&name2=value2&name3=value3 ...    (050)

Example 2: A no parse header that might be returned by an imagemap section to ignore the user's mouse click    (051)

HTTP/1.0 204 No Response
Content-Type: text/plain    (052)

Example 3: (a) Code that uses the cgihtml library; (b) equivalent code without the cgihtml library.    (053)


(a)
html_header();
html_begin("HTML Page");
h1("HTML Page");
printf("<p>This is a sample HTML page.\r\n");
html_end();

(b)
printf("Content-Type: text/html\r\n\r\n");
printf("<html> <head>\r\n");
printf("<title>HTML Page</title>\r\n");
printf("</head>\r\n<body>\r\n");
printf("<h1>HTML Page</h1>\r\n");
printf("<p>This is a sample HTML page.\r\n");
printf("</body></html>\r\n");    (054)

Example 4: A sample form in HTML that uses the CGI program query results.    (055)


<form method=POST action="/cgi-bin/query-results">
<p>Name:   <input type=text name="name">
<p>Age:    <input type=text name="age">
<p>E-mail: <input type=text name="email">
<p><input type=submit>
</form>    (056)

Example 5: The query-results program, which reads data from a form and returns a page of data values    (057)


#include <stdio.h>
#include "cgi-lib.h"
#include "html-lib.h"
int main()
{
       llist items;
       read_cgi_input(&items);
       html_header();
       html_begin("Query Results");
       h1("Query Results");
       print_entries(items);
       html_end();
       list_clear(entries);
       exit(0);
}    (058)

Example 6: (a) Code that has a security hole; (b) a user can breach security via shell metacharacters; (c) modified program that escapes the input string; (d) the popen() call rendered harmless.    (059)


(a) llist items;
    char *command;
    read_cgi_input(&items);
    strcpy(command,"/usr/lib/sendmail ");
    strcat(command,cgi_val(items,"to"));
    popen(command);

(b) ; touch /tmp/I_HAVE_ACCESS

(c) llist items;
    char *command;
    read_cgi_input(&items);
    strcpy(command,"/usr/lib/sendmail ");
    strcat(command,escape_input( cgi_val(items,"to") ));
    popen(command);

(d) /usr/lib/sendmail \; touch /tmp/I_HAVE_ACCESS    (060)

Example 7: Program with a built-in watchdog timer.    (061)


#include <signal.h>
#include <unistd.h>
#include "cgi-lib.h"
int main()
{
 signal(SIGALRM,die);
  alarm(30);
 while (1) ;
}    (062)

Example 8: Program that sends an HTML page tailored to the type of browser.    (063)


#include "cgi-lib.h"
#include "html-lib.h"
int main()
{
  if (accept_image())
   show_html_page("/index-img.html");
  else
   show_html_page("/index-txt.html");
}    (064)

Figures    (065)

Figure 1: Data flow between the browser, server, and CGI program.    (066)

Figure 2: The read_cgi_input() function parses the raw data and places the entries in a linked list.    (067)