The Building of TrainingPages

Background

TrainingPages (www.trainingpages.co.uk) was an idea that popped up over a few beers one night when the technical department set off to wash away the dust and sweat of a hard day pounding the keyboard.

A lot of conflicting requirements got thrown into the mixing-bowl and what emerged surprised us all. Most surprising of all was that the ‘brilliant’ ideas of the night before (and we're never short of them) still made sense in the cold light of the morning-after. A rare thing indeed, considering the madcap ideas that usually get aired on one of those nights.

This article is about the ideas, the plans, the way we set about building a dynamic data-driven website, the pitfalls and some of the dead-ends that we found on the way. We think — hope — that there'll be something useful in here for anyone wanting to do something similar. The result for us is a framework that is already proving its value in e-commerce applications and proof, if any more were needed, that Linux and ‘free’ software is a great choice for this kind of job.

The Cunning Plan

The Friday night celebrations have never lacked imagination; it's been the judgement and alignment with a recognisable version of reality that's mostly been obvious by its absence. This has, naturally, never dampened the spirits at the time. Once again we came to find ourselves sitting around and pondering the philosophical issues of the moment, when the topic came around to why most websites were so dull and boring. Our gang were convinced that they could do much better and started listing how.

Once the ‘what’ was dealt with, the assumption of a compelling commercial case was exhaustively analysed, even continuing all the way to the curry-house. The conversation would normally have been reduced to grunt-level by then, so it did seem that there was the germ of a good idea in it all.

After chewing the basics over for a while, we couldn't see much wrong with the developing plan and it was possible to make a commercial case for doing it, so we set about refining the ideas.

There wasn't — and isn't — a single compelling argument in favour of the plan: it's more a combination of things that add up to make sense as a whole, so they're listed below in no specific order, just outlining each facet as it occurred to us.

At the time we were thinking about it, the industry was having a love-affair with ‘portal’ websites. These are supposed to bring lots of things together and encourage high levels of traffic, so selling advertising and generally being A Good Thing. We were mildy sceptical, but it shaded our view. What really caused, and still causes, frustration was having to scour dozens of websites to find information about a particular product or service. Each commercial operator has their own website, with different layouts, different levels of information and usually without decent searching or navigational tools. If you're looking to buy something — a holiday, a flight, a used car or whatever — what you actually want is a single point of contact, where all the information is in one place, in a common format and easily navigated around. We really couldn't find many examples of this done well. At the time, the airlines all had their own sites, selling their own fares, but it was near impossible to find it all consolidated so you could search for lowest fare / most convenient connections / etc. We wanted to show that there's a better way, something like an online directory, rather than a heap of different catalogues from single suppliers. At the time of writing there are still very few examples that we think do it right, but we're clearly no longer alone. After all, it's hardly a novel idea.

We were also frustrated by the low-tech crud that passes for the average website. Most websites are still pretty dire, typically full of pretty pictures but seriously lacking in facts and information. We reckon this is mainly because they're still built by design companies used to working with paper who neither know nor will thank you for telling them that these things can be live, dynamic and filled with valuable facts. Considering how easy it is to create tailored websites where the pages are aimed at helping the viewer rather than easing life for the designer or maintainer, it's a disgrace that the world still tolerates anything else, but it clearly does. We were certain we could do much better.

A fair amount of our business is training. We enjoy it — our team contains talented educators and we have excellent training materials for our core subjects. We'd like to expand our market there, because when it's done well it's great for the company. It is financially rewarding, gives us the chance to meet lots of people face-to-face and so broaden our own experience, plus it gets us out of the office and sometimes even to interesting places. Our problem is that it's a constant battle to find new customers and we're specialised to the point where you don't get a lot of repeat business. An easy way to market ourselves would be extremely welcome. Our Cunning Plan seemed to have a lot to offer along these lines too.

So, the Cunning Plan was to bring all of these ideas together: a training-oriented database, organised totally around live data, presented in the form of a web search engine with shades of Yahoo and borrowing the better ideas that we could pick up or think up for ourselves. With luck it would prove to be a highly useful resource and might one day become a revenue generator in its own right, but in any case we could point to it as a demonstrator of our capabilities and we could get the ‘why does nobody do it right?’ frustration out of our system.

The Details

Having worked in professional training for many years, we knew quite a lot about how that market operates. Most training companies send out mountains of brochures with which they bombard likely purchasers. They have to advertise (bemoaning the expense) to try to find new customers, whilst in the main realising that it's not hugely effective. Despite best efforts to do otherwise, the companies get the majority of their business from existing customers for the simple reason that the customers prefer to work that way. For most purchasers of training, it's more effort to locate a new source and try it out than to go with with the tried-and-trusted supplier you've worked with for years. The phrase “better the devil you know” springs to mind, though that would be unfair, as of course the bulk of established trainers provide good levels of service at broadly similar prices.

At the periphery it's difficult: though established companies have established customers and well-understood training curricula, technology changes (partly through need and partly through fashion). Odd niches spring up, specialist areas have always existed where the volume isn't interesting to the large players… how do the suppliers and the customers in these cases meet up?

The Internet could be one way that customers can search for suppliers, but the search engines are too general and it's particularly hard to focus the search to one area. In the UK, it's not very interesting to find Perl training suppliers in New Zealand. Our idea of a UK-only database makes a lot of sense in this case.

We didn't want to spend our lives maintaining and updating the data, so we concluded that the best thing to do was to provide sign-up and maintenance forms so that anyone can ‘join’ the database, then post and edit details of their own products. The most visible examples of this style of website are the classified ad pages (for example, classifieds2000.net) and sites in the general mould of Hotmail, GeoCities and their ilk.

We wanted to go a stage further than just this. It's strange but true that there is a dearth of market research data available for the professional training market in the UK. Training companies may well know to the last fraction how well their own courses are selling and what the sales trends over time may be, but they usually have no clue at all about their competitors. So, a highly successful company might completely miss the fact that a brand-new market is springing up around them unless they listen very carefully to their customers' more unusual requests. Even if those requests do get noted, it's very hard to substantiate the size of the market and the overall level of interest.

We felt that a valuable feature of an intelligent website would be the ability to track market trends based on click patterns and then to present that information analytically for those who are interested. Frankly, we thought we'd be able to charge good money for that information once word got around, so we designed the database with that in mind. We are convinced that this will be a common feature of all major sites before long, but we don't know of many that do it yet.

We chatted, plotted, schemed and hatched up various ways of implementing our thoughts. It wasn't all that difficult; the components and most of the approach seemed simple and obvious, so long before paralysis by analysis set in, we simply rolled up our sleeves and started hacking away.

The Basics

The basic components were obvious. There was no budget, so the software had to be free. We needed to prove that WE could do it, so off-the-shelf packages weren't much of a choice. It had to fulfil our unstated belief that what's now called ‘open source’ software is not just a match for commercial alternatives, but in many cases is hands-down winner. There was no contest: it was going to be Linux for the operating system, Apache as the webserver and then a suitable database and a mixture of whatever seemed best to write the database queries and presentation logic in.

Linux was easy. We'd been using it for long enough and since it never crashes, takes next to no maintenance and runs on low-spec hardware as well as commercial systems do on high-end servers, it's hard to see how anyone with the normal number of brain cells could see much else as an alternative, apart from the other free Unix-alikes.

Apache calls for ‘ditto’. Its flexibility of configuration and use goes way beyond anything we would ask it to do and whilst it's performance isn't at the ‘screaming’ end of the scale, raw performance was the last thing on our minds anyway. We were thinking in terms of hits per minute, not thousands per second.

The database called for at least two seconds' worth of thought. The contenders were mySQL and Postgres. A straw poll around the table demonstrated that we had in-house knowledge of mySQL and none of Postgres, plus the view that mySQL was more than adequate for the job. So that was that decision made.

The choice of scripting and coding languages was trickier. The ‘obvious’ contender for data-oriented Linux-based websites is PHP. In all honesty, at the time we didn't know enough about PHP to make a decision. Now we're more familiar with it, we might rethink some of our views, though they weren't too far off the mark at the time. Our reading of the situation was that it works well for pages where HTML has the greater part of the job and the data management component has the lesser. There will certainly be people who disagree, but that was our understanding at the time. We would almost certainly use it now in place of our template-tag replacement language (of which more later).

PHP embeds the scripting into the HTML pages, which are interpreted by the webserver as they are processed for delivery to the user's browser. PHP is available as a module for Apache (though the module doesn't have to be used, it runs standalone too) which is very useful for high-performance sites. There are thousands of websites built using it and it clearly works extremely well. The scripting language resembles Perl and there is a rich set of very good ideas embodied in the language and the implementation.

For good or ill, we chose NOT to use PHP, but to go down a different route. This can probably be partly laid at the door of ignorance, partly because we started with small plans that grew and were too lazy to learn PHP before starting to code, and partly because our instincts leant away from it for a project of this size. PHP has grown in stature since we took that decision, and if we knew then what we know now, we might have made a different decision.

The instinct part was because we wanted to build a framework that was applicable to more than just the training website. We felt this site would be mostly logic (i.e. programming), not presentation (HTML). We also wanted to be able to offer it in various languages, so didn't think that embedding the code in the pages was the right way around; instead we turned it on its head and decided to separate the code and the HTML as rigorously as we could. Whether right or wrong, that was the path we took.

When the project was mooted the consensus was that Perl would be a fine tool to use. The crude prototype was given no time budget and was going to be done as a spare time (i.e. during the hours when normal people would be sleeping or having a life) item. It's embarrassing to have to reveal the truth, but in the very early stages, we had a problem getting the Perl/mySQL interface going, so as a short-term expedient started off with C++, intending to make a break later. Anyone with experience of typical software projects will know exactly what happened next: by the time we fixed the Perl problem, the C++ bit was twitching happily on the slab and grew rapidly. Before we knew it, there was no turning back. The upshot is that we now have 11,000 lines of C++ in the delivered code and 350 of Perl. Quite a lot of the C++ is our own reusable libraries for string and object handling, but the effect of the early decision remains with us to this day. We're happy with the result, but it's hard to answer the question when people ask what fundamental principle made us choose C++. The answer is mostly ‘we just did’.

Separating the layout and the code was a GOOD IDEA. We are very pleased with the result. It's not perfect, but it works very well for us. Similar ideas exist in some commercial products, and they do it better, but since the total cost to us was about half a day of coding, and we can alter it if we don't like it, we have no plans for change.

To see how it works, we need to look at the broad sweep.

The Broad Sweep — How The Site Works

Most of the pages that typical visitors will see are generated entirely by CGI programs. There are some ‘static’ information pages (and a whole section of the site for the use of search engines) that work differently, but it's the CGI that is the main thrust.

If you aren't familiar with CGI programs, a brief outline will help. We've configured our Apache web server so that requests for pages with names ending in .html or .htm are delivered as normal web pages: the contents of those pages are copied straight out to the user's browser with no changes. Naturally, they will have to contain HTML just like any other standard web page. If the page name ends with .cgi, something different happens — the corresponding file on our server is treated not as data but as a program and the server runs that program, taking the output of the program and passing that back to the user's browser. Making this work is well documented in the Apache literature and involves nothing more than a few small changes to the httpd.conf and srm.conf configuration files. It's a minor change. We've also arranged for requests for the home directory (i.e. a URL of just www.trainingpages.co.uk with no ‘/’ or filename) to invoke the program index.cgi, a one-line modification over the standard setup.

CGI sounds forbidding if you don't know what it's about, but it's nothing more than a simple way of passing data into those .cgi programs. When the ‘submit’ button on a web page is pressed, any data in the form is encoded and then sent back to the server along with the URL of the program to handle the form: the server finds the program, sets it running and passes in the encoded form data (plus a lot of other stuff). Much the same happens with ordinary links: something like

http://www.trainingpages.co.uk/blah.cgi?a=b&c=d

invokes ‘blah.cgi’ and passes into it encoded data which basically says that element ‘a’ should be set to the value ‘b’ and element ‘c’ to ‘d’.

CGI is basic, reasonably flexible and adequate for many web-based interactions. Its drawbacks are well-known but they didn't affect us too badly. It would be nice if more intelligence could be given to the browser, so that mis-filled forms are detected earlier and we didn't have to spend so much time putting in the error-handling and reporting; the web-form/CGI model of working is clunky at best. We put some effort into adding Javascript intelligence to some of the pages to improve the checking, which did improve the experience that the users get if they mis-fill a form, but not every browser supports Javascript and as a result we still have to provide all of the error handling at the server end.

Adding Javascript proved not be a great plan. The site had been working well before we added it and immediately after we'd done it we started to see evidence of weird errors and failures with the forms that were being sent back. Taking the Javascript out stopped it immediately and the conclusion was that for the small benefit it brings, the weird problems are a poor payoff. We gave up hunting for the cause of the problem and just accepted the cure.

We started off by having a separate program for each interaction, with names like index.cgi, get_cats.cgi, show_course_details.cgi and numerous others. By about the third one we realised that this wasn't very efficient. Each program contains 99% the same stuff and about 1% that is actually specific to the query it's going to execute, so we coalesced them all into one big chunk and used the program name (available through its argument list) to select the action. We avoided having multiple copies by creating symbolic links with different names that all pointed to the same copy of the program.

The symbolic link tactic worked for a while, but as the numbers of different names increased they became a nuisance to manage. The development code was running on one system and the production version on another with a very simple mechanism used to copy the stuff over — so simple it didn't handle the links well — and we were too idle to fix it. The solution that finally emerged as the most popular was to add a hidden ‘action’ field to each form and have that drive the selection of the code to execute in the monster combined program.

None of this was particularly difficult to do and the final tactic evolved over time. We never did get around to picking a single standard, so the live site still shows the mixture of approaches.

Quite early on we had to come up with our way of generating the HTML to send back to the user's browser. We decided to use ‘template’ files which contained the framework of the HTML, but with tags inserted to show where content needed to be replaced by the results of the database lookup. There are lots of commercial tools which use a similar approach but we didn't think it was worth the effort of using one when we could write it ourselves in a few hours, with the bonus of being able to make it do exactly what we wanted. The features we needed got added as and when we found them, but overall it's still only a small piece of code that does it.

The tag replacement language in the template files has proved to be a great success. By far the greatest part of the HTML lives in the templates, allowing the C++ code to concentrate on the application logic. The templates can easily be modified by anyone who knows HTML and spends a few minutes learning what the other tags do. The whole look and feel of the site can be changed without needing to change the C++. Our separation of layout from content isn't perfect but it has helped a lot. The C++ still does have to produce a limited amount of HTML, especially for tabular output, and if we could stir ourselves to make the effort, that would be the next stage in enhancing the tag language. Here is a brief description of what it looks like.

Most special tags start with ‘<%’. A tag consisting simply of <%anything> is replaced by the text associated with ‘anything’, i.e. the value of the variable ‘anything’. It's possible to assign values to these tags either in the template files or in the C++ code. Setting them in the template files is very helpful in parameterizing the overall look and styling: <%tfont=<font face="arial, helvetica" size="+1" color="#006666">>

then later… <%tfont>Here is some text in the special font</font>

so if we ever want to change the appearance of the ‘tfont’ style, we just change the assignment to tfont in a single place. This really can be a single place, since the tag language supports file inclusion with <%include stdddefs.tpl> being found at the head of most template files. The stddefs.tpl file contains the standard definitions of fonts and colours for all of the pages. Change them there, and the entire site changes immediately.

The tag language is not very clever, but it does what we need. Adding conditional sections to the template files helped us to control another problem, the proliferation of templated pages. Now there are many fewer template files since all of the pages in a similar family are written along the lines of

<%include stddefs.tpl>
... standard head section ...
... preamble to standard body section ...
<%?pagetype_1>
... stuff for page type 1 ...
<%/pagetype_1>
<%?pagetype_2>
... stuff for page type 2 ...
<%/pagetype_2>
... standard end of page stuff ...

A tag like <%?xxx> causes the following lines, up to the following <%/xxx> to be processed only if the variable ‘xxx’ is set. The two conditional markers themselves are not copied through. <%!xxx> is the inverse, giving us an if-not-set capability and allowing us (if you think about it for a moment) to dispense with the need for an ‘else’ clause:

<%?xxx>
stuff
<%/xxx>
<%!xxx>
essentially the ‘else’ part
<%/xxx>

The entire tag replacement language is implemented in less than 250 lines of ugly C++.

In the C++ code, there is a single (global) instance of an object whose job is to provide an interface to the tag language. It's implemented by a class that we called ‘htpl’ (from HTML template) and its name is ‘Mytemplate’, which may well help to confuse anyone who knows about C++ templates (it's nothing to do with them). When the program starts running, after figuring out what it's going to do, the appropriate template file name is plugged into the Mytemplate object.

The execution of the program will usually involve a number of database queries which then get packaged up into textual results. These are pieced together as strings in the C++ and eventually inserted into the template object using an array-like syntax; the index of the array is another string which matches a template tag name. The C++ code has lots of lines looking like these:

        Mytemplate["cat_name"] = cat_name;
        Mytemplate["title"] = "Training Pages : <%cat_name> Category";

When the program exits, the Mytemplate object is destroyed. In its destructor, which C++ calls just before it finally gets destroyed, the code to process the template file is called. The lines are scanned looking for tags to replace and each time a tag like <%title> is seen, the corresponding string in Mytemplate is used to replace it. Each time a tag is replaced, the line is rescanned for more replacements so we can use tags which reference others. This is important for the parameterized colours and fonts as well as being generally useful.

We have a fetish about producing correct HTML. It's hard to guarantee that we manage it once you take into account the various nested files and conditional sections that we use, so we apply a belt-and-braces check. The code which does the tag replacement opens a Unix pipe to a shell script called ‘doweblint’. This script acts as a wrapper to the popular ‘weblint’ Perl program which checks HTML for validity. If the weblint tool reports an error, the offending page is emailed together with the error message to our development team. This causes a rapid response to fix the error because nobody like to have their mailbox filled up with whining messages from weblint.

Weblint is not the fastest program in the world but it takes less than a second for it to check pages the size of the ones that we generate. If we started to get high levels of traffic through the site then we'd probably have to use a different strategy but it's nowhere near being a significant bottleneck yet.

It finds problems very quickly!

The Database

We chose mySQL. It's free for use so long as you install it yourself (an interesting licence condition) and costs $100 in one-off quantities if you sell it pre-installed. It's fast, simple and it works. It provides a reasonable implementation of Standard SQL with one rather irritating restriction: currently it doesn't support nested SELECT statements. If you need high-level database support like transactions with commit and rollback, or the ability to checkpoint the database and then replay a transaction log, it's going to disappoint you. Apart from that, we haven't had the slightest problem with it.

mySQL works in client-server mode. Applications make a network connection to the server and all of the queries and results are communicated through the connection. There are numerous drivers available for this interface, including ODBC, JDBC, Perl DBI and a C-language library.

We cooked up an object-like interface to the C library in C++ and use that. The CGI program creates a single Connection object, then repeatedly uses that to run Query objects. Here's a sample of the code from one of the many test and debugging proglets:

// Connect to database 'training' on system 'landlord' with user 'fred'
// and password 'fredpass'
Connection c("training", "landlord", "fred", "fredpass");
if(!c){
    cout << "Failed to open database: reason" <<
            c.error() << endl;
    return 0;
}
Query cq(c, "select * from company");
if(cq){
    array_of_String aos;
    cq.fetch_row(aos);
    cout << "First column value is" <<  aos[0] << endl;
}

If a query succeeds we retrieve a row of the resultset into an array_of_String object and then access each field by indexing into the array. This may not be smartest or most efficient way of using the C interface, but it works well for us and we've got no plans to change.

All of the management, maintenance and daily use of the database is done through the web interface that we have written using the CGI programs. Initially we didn't have all of the management pages done and we needed to get the first few hundred courses inserted into the database while the development was continuing in parallel. By far the quickest way to whistle up a few data-entry forms was to use one of the Windows-based tools (we chose Access) and to talk to mySQL through its ODBC driver. So we did. Those tools are nice and Linux really needs something as simple and easy to use. At the time of writing we're still not aware of anything as good in the free software domain. There are several projects underway to get there so we are hoping…

As a consequence of using Access we had to insert some fake Primary Key columns into tables that didn't really need them, but they were stripped out once the job was done. We've finished with Access now.

Here's the current set of tables in the database as reported by mySQL:

+--------------------+
| Tables in training |
+--------------------+
| authlist           |
| category           |
| category_links     |
| company            |
| config             |
| course             |
| course_category    |
| course_date        |
| deleted_courses    |
| deliveries         |
| difficulty         |
| duration           |
| enquiries          |
| hotlist            |
| keyword_banners    |
| link_images        |
| link_info          |
| location           |
| random_banners     |
| see_also           |
| sessionkeys        |
| stat_category      |
| stat_course        |
| stat_hotlist       |
| stat_keyword       |
| virtualcats        |
+--------------------+

and a description of the columns in a sample table:

+-------------------+------------+------+-----+---------+----------------+
| Field             | Type       | Null | Key | Default | Extra          |
+-------------------+------------+------+-----+---------+----------------+
| company_name      | char(150)  | YES  |     | NULL    |                |
| contact_address_1 | char(150)  | YES  |     | NULL    |                |
| contact_address_2 | char(150)  | YES  |     | NULL    |                |
| contact_address_3 | char(150)  | YES  |     | NULL    |                |
| contact_postcode  | char(12)   | YES  |     | NULL    |                |
| contact_web       | char(200)  | YES  |     | NULL    |                |
| contact_telephone | char(40)   | YES  |     | NULL    |                |
| contact_fax       | char(40)   | YES  |     | NULL    |                |
| info              | char(255)  | YES  |     | NULL    |                |
| password          | char(10)   | YES  |     | NULL    |                |
| unique_id         | bigint(21) |      | PRI | 0       | auto_increment |
| contact_email     | char(200)  | YES  |     | NULL    |                |
| logo              | char(40)   | YES  |     | NULL    |                |
| contact_name      | char(150)  | YES  |     | NULL    |                |
+-------------------+------------+------+-----+---------+----------------+

We make a lot of use of the auto_increment columns to provide us with guaranteed unique primary keys on tables. None of us could remember enough relational database theory to decide whether this was an approved thing to do, and fortunately, none of us care either. The database generates a number each time a new record is entered into the database, guaranteed to be different from any of the others in the same column. Using this we can always uniquely identify a particular record and it saves us the bother of having to ensure uniqueness elswhere amongst that data fields.

Session Management And Tracking

With the template idea proven to work and the database giving no problems at all, we had to get stuck into nitty-gritty design.

We wanted to add a ‘hotlist’ to the site, so that anyone browsing could make a note of courses that looked interesting, then review them later at leisure without having to repeat the searching. Though we call it the hotlist, it's exactly the same as a ‘shopping cart’ on a retail site.

It seems to be a simple feature to add, especially if you don't know how the web works. Most computer systems wouldn't have the slightest problem in relating one mouse-click to another and allowing us to track the items added to the hotlist, then displaying it when it was needed. Unfortunately, it's not simple at all.

The way that web servers work — a feature of the HTTP protocol used — makes it very hard to relate one page request to another. Each comes in anonymously and independently of any other. If you request one page, then click on a link in that page, the new request is unrelated to the first and it's not at all easy to ‘track’ a user of a website. You might think that your internet address (IP address, not email address) could be used to identify you, but that may change with time and if you are using a proxy, your address could be the same as thousands of others who visit. This is a well-known and tricky problem for web site designers. The way that it's handled more-or-less transparently by some commercial web servers in their interface to server-side programming is one of the strong ease-of-use arguments that their vendors will make.

We weren't prepared to trust commercial software though. A while ago we were using a large and well-known book retailer in the UK to order some books when a problem arose in the middle of filling-in the allegedly secure order form. When a colleague went to look at the site to see what might be wrong, from a different PC and using a different web browser, the ‘secure’ details page popped up onto the screen, including the credit card number!! Our guess is that they were using the IP address of the requester to distinguish between sessions… but since we go through a firewall, all of our web browsers appear to have the same address. We decided to hand-craft our solution, since then we would be able to fix it ourselves if anything went wrong.

We took a common approach, using a tactic involving session keys, cookies and modified links on all the pages.

We use a ‘session key’ to track each visitor. Our session keys are just numbers; as long as they are different for each visitor their value is not important. We ensure uniqueness by encoding the time-of-day and the Unix ‘process ID’ involved in the specific CGI program to create them, but the details are not important. Each time a new visitor arrives, a session key is generated for them and used as an index into the database where the session-tracking information is kept. Once we have associated a session key with a session, we ensure it's linked with each subsequent page request by attempting to hand it out in a cookie and also embed the session key in every form and page-to-page link that is generated.

Cookies are supported by all the common web browsers. They work a bit like a cloakroom ticket: when you hand in your coat or bag, you get given a ticket. The cloakroom staff don't know who you are, they just know that when you hand back the ticket, they give you the coat or bag back. Cookies are handed out by web sites, then when you request the next page from the site, your browser hands back the appropriate cookie. The web site doesn't know who you are, but it sees the cookie it gave out a while ago, so it can figure out from that (if it's smart enough) how to track your session.

When we generate a new session key, we encode it into a cookie and hand it over the visitor. If their browser supports cookies we will see that same key coming back in with each subsequent request and can use that to retrieve their session details. It's easy, it's simple: but (between gritted teeth) it's not reliable.

If every browser supported cookies, that would be all we would need to do. Unfortunately not all do, and the popular ones allow you to turn them off if you are paranoid about remaining secret. Because of this, we can't rely on cookies being available to us.

Our second method of session tracking is to embed the session key in every link and form that we generate when the server delivers pages to the user's browser. That means that every time a link is followed or a form submitted, we get handed the session key back as part of the URL and we can extract it. Since every page apart from the home page is always the result of clicking a link or submitting a form, we can reliably track sessions in that way.

If the embedded-link approach is guaranteed to work, why should we bother with cookies? The answer is that we want visitors to be able to leave, come back days later, and still find that their hotlist is preserved. That won't work if we only use the links/forms method. Of course, if they don't have cookies, then we can't do it, but the majority of visitors DO have cookies working.

The combination approach gives the widest coverage and works well in practice. One important piece of housekeeping is to discard sessions that are dead and gone, which requires a small amount of care. If someone visits with cookies turned off (or is a new visitor) we have to create a new session key. If they leave our site then come back, we can identify them only if they do present us with a cookie. Each apparent new visit requires a new session key, but many session keys will eventually be lost and become moribund. At present we retain them for about three months, then delete them from the session key table in the database. If someone comes back from the dead, we have to ignore their session key and treat them as new, which may mean, in the case of cookies, telling them to delete their previous cookie. It took a day or so of trying the various combinations before we were happy with that, but now it works fine.

A side-issue is ‘maintenance sessions’. These are when someone logs-in to the editing and updating parts of the site. We can't possibly request each page request to be accompanied by a username and password, so those are requested only on the initial sign-in for maintenance. Maintenance sessions are identified in the database by associating a particular company with the session. Each time activity happens on the session — a page is requested or a change made — a timer is reset. If the timer reaches a fixed limit (which is kept short), then the session has the link to the company broken and reverts to being an ordinary session. This helps to avoid the situation where someone is sharing a PC, logs in to do maintenance, then leaves, but someone comes along later and finds the maintenance stuff in their history list then clicks back into the session. Again this is working well and doesn't require brand-new sessions for maintenance.

The code has been carefully structured so that most of the general housekeeping is independent of the particular application logic — well, we say carefully structured, but we keep finding ways of improving the generic parts and reducing their dependence on the application-specific components. This process will probably continue for ever.

Probably the least successful tactic was to be sceptical of the GNU C++ compiler's exception handling. We've worked hard to check every possible error condition and to log it to a debug file. We have probably missed some, but here's a typical (and genuine) chunk of the C++ code.

if(category_id == String::NULLSTRING){
  error("Failed to find non-null category id in current session"
    + indexsession + " in get_cats()",
      "internal", Mytemplate);
  return 0;
}
if (!Myform.contains("sitemap")) {
  String s(time(0));
  Query mq(Con, "insert into stat_category (time, category_id)"
    + "values (" + quote(s.cstring()) + ", " + quote(category_id)
    + ")");
  if (!mq){
    Log::fail("Failed to update stat_category table in get_cats(),"
      + " reason is: " + mq.error());
    // can continue, this is not fatal, we just miss a click
  }
}

The important point to note from the code above is that a) we have to provide a reasonable error handler for every conceivable problem, b) the size of the code overall is (at our estimate) between 50%–60% dealing with error conditions. This is a fact of life in serious software development. That's how it works.

A much better tactic to use is to make use of exception handling if there is support for it in your development language. You can write most of the code in the presumption that things just work ‘right’ and provided you have written the lower layers of the code to throw appropriate exceptions, rather than testing at every function call, you can surround large chunks of logic with their own try{...}catch(){...} sequences. In retrospect this would have saved us a lot of coding, shrunk the size of the programs by at least a quarter and probably have speeded up the development by more than that. The reason we didn't use it was sheer prejudice. C++ compilers have not traditionally been very good at implementing exception handling — we still don't know if GNU C++ is good or not, since we haven't used that part extensively — and it introduced yet another unknown at a time when we didn't want one.

If we do any rewriting, it'll almost certainly start by introducing exception handling as the general mechanism for dealing with errors. We'll check that it's reliable, then spread it throughout the code. A shame that C++ doesn't use Java's compile-time facility to tell you about uncaught exceptions, because that would make the migration process much more simple. We'll have to do it by eyeball when we do.

Site Tricks And Tactics

Building a site isn't much good if nobody uses it. A database is useless if it's empty. We had to get around this to make the site of any use at all, so the first tactic was to pre-populate it from brochures we could find from some of the major training companies. We weren't stealing their copyright and if any objected we would remove them from the site, but in fact none did. Seeding the database took a while, but it meant that when the database went live it had useful information in it from the beginning. Once seeded and live, the entries in the database are maintained by the training providers themselves and we do very limited maintenance on it. We do insist on controlling the way that the categories are organised and sometimes rearrange them through a handful of maintenance screens that are only usable from inside our firewall.

We decided that, like the various commercial directories, a basic listing should be free, with options to pay for enhanced features. This has certainly proved to be popular. We seeded the database with about 20 companies and 500 or so course listings, and as of today (3rd June 1999) it has just rolled over 3,000 courses listed and 265 training providers. The add-on features are 1) to be able to sort to the top of listings, 2) to associate a company logo with the course details and then features like keyword rental and banner rental in categories. So far we aren't pushing those commercial opportunities because we prefer to build traffic through the site and then charge a realistic figure than to have to start low and rack the rates up.

To be fair to all of the providers listed, we randomise the presentation of courses within the listings. Each session has its own random number generator and this is used to select the order of the listings. The effect is that you will see the same ordering each time from within a single session, but other users with other sessions see a different order. Courses with the list-to-the-top flag set have a large positive bias added to them so that they will always come above courses without the flag, but if two providers both have list-to-top then *their* courses are arranged at the top in a similar randomised fashion.

Getting a database like this listed with the search engines sounds like a good idea — but it's problematic. You want them to point to you so that the public can find you, but you do NOT want them to be sending their robot link-followers crawling all over. The robots are programmed to follow every link on a site, which would mean that they would endlessly be causing sessions to be generated, adding courses to their (huge) hotlists and so on. And, as a consequence, there would be some very strange listings showing as search results. This is decidedly unwelcome.

Instead, we created a robots.txt file (the standard thing to do) which bars them from using any of the dynamic parts of the site. We set up a nightly task to run which queries the categories and the courses in the database and which generates a whole sheaf of static HTML pages — a catalogue of the database contents — each of which links into the live part of the site. We point the search engines at those pages instead and have a secret hidden link from the home page which also points to them. This approach has proved to be very successful at getting the site indexed by the major robotic search engines and we'd certainly use it again for a similar site.

The next tactic to drive up the traffic was to look at our own categories then go hunting for specialist websites that contain pointers to useful resources. Our goal was to get them to point to us and say ‘if you want training in poodle-grooming, click here’. We didn't want those links to point just to the home page, so we created a special entry-point direct to the appropriate categrory. A wrinkle that gives us more flexibility was that rather than pointing directly to our category for poodle-grooming (sorry, we don't really have that one yet) is to have a ‘virtual category’ which can be set to point to whichever of the current categories is most appropriate. The virtual category means that ‘poodle grooming’ could point to pet-care in the early days, then show-dogs when that category was worth setting up as a separate entity, then finally poodle-grooming if we ever got that far. The other site, the one that links to us, wouldn't have to change. We keep an eye on our referer-log and it's clear that this is an important way of building up traffic to the categories. An example of a virtual link is this one to Linux.

Something we haven't seen much of yet but we are sure will become a must-have is our statistics collecting. Every click to the site is logged, along with the time and date, what category, what course etc. etc. Keyword searches are captured… you name it, we probably capture it. The stats tables in the database are not small. These allow us to do lots of data-mining. We can rank the most popular courses against others in their category, we can rank the categories, we can see preferences moving over time, we can dump the data to spreadsheets for graphical analysis and we even generate our own graphs to show the public (free) and the course providers (not free) where the market is going and what is proving to be popular.

We don't know of anyone else at the moment who can offer live consolidated market research of this nature. That doesn't mean it's not available (just that we don't know of it), but it's currently not a feature of what most websites can do. The truth is that it's incredibly easy to add. The statistics gathering code is a tiny fraction of what we had to do. The graphing may not be beautiful, but we simply picked the Perl modules to do it and off we went — this is the bit of the site that we did using Perl.

Overall Experience

We set out to do it. In about three months of effort the bulk of the site was up and running. There's less than six staff months of work in the whole thing, yet we think it stacks up better than 99% of all the websites we have ever seen. It's generating increasing levels of traffic, regular training leads for us and the other providers who are listed and on a day-to-day basis we can just leave it running, mostly looking after itself.

We think it will prove to be an example of next-generation capability as we roll on into the year 2000. There is a small number of sites of equivalent function, but we bet they spent a LOT more than we did!

Ours is built totally from free software and is extremely reliable. At this moment, the server (which hosts a dozen other commercial websites) is showing an uptime of 183 days and 18 hours; its most recent restart was due to a fault at our leased-line providers' end which meant we had to swap a card over to prove the fault wasn't at our end. Prior to that it had never crashed or needed to be rebooted since we installed it 9 months ago. Rough tests show that with no optimisation at all in our code it can probably service 2–3 requests per second — vastly more than our bandwidth could ever support. It's running on an AMD K266 processor with 64MB of RAM. It says it has a load average of 0.01 — i.e. it's 99% idle.

If we were to rewrite what we have done we would switch to C++ exception handling — a big win we expect — and also replace our tag language with php which is vastly more powerful and widely used and supported. But what we have works perfectly well enough; we're just perfectionists when it comes to software.

We have to say we think it's a success!