From owner-freebsd-doc Sat Feb 15 12:23:35 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id MAA11863 for doc-outgoing; Sat, 15 Feb 1997 12:23:35 -0800 (PST) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id MAA11847 for ; Sat, 15 Feb 1997 12:23:19 -0800 (PST) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id PAA19290; Sat, 15 Feb 1997 15:23:18 -0500 (EST) Date: Sat, 15 Feb 1997 15:23:18 -0500 (EST) From: John Fieber To: doc@freebsd.org cc: "Jordan K. Hubbard" Subject: Re: cvs commit: www/data security.sgml In-Reply-To: <11604.856026076@time.cdrom.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-doc@freebsd.org X-Loop: FreeBSD.org Precedence: bulk I suggested to Jordan that his recent "FreeBSD" security document, which was added to the web server as an HTML page, should be written up as a tutorial using Docbook. He responds with: > Is docbook something I want to employ *only* for tutorials and > handbook pieces (when the handbook goes docbook, that is), and if this > had stayed up where auditors.sgml is, would the format in which I > originally wrote it be considered "correct?" This HTML-wrapped-in-SGML > stuff we have up at the top level of the web page hierarchy still sort > of confuses me, I'll admit, and I'm not sure to what degree it's wise or > allowable to use more specialized SGML features in there, or even what > "styles" are appropriate for which areas of the doc tree at the > moment. Maybe a short paragraph, sent to www for the benefit of > everyone, during these transitionary times? :-) Since the web pages have recently become a much more communal project than they were, some explanation is definately in order! Two versions are supplied. Pick the one that matches your attention span. Those opting out of the long version would be advised to skip to the end for the summary however. :) Short version: Any document that will only see the light of day in the context of a (the) web server is best written in HTML. Any document that might have a useful life in any context outside the web serve should probably not be written in HTML. The viable alternatives are linuxdoc (discouraged) and docbook (encouraged). The primary role of documents written directly in HTML is to provide an access path to these documents. Long version: An HTML file is the standard web transaction unit. For a variety of reasons, writing and maintaining documents in transaction unit chunks is a chore unless the conceptual content fits neatly in one, a very small handful of transactional units. Despite claims to the contrary, and its original intent, HTML is primairly a layout markup language with hypertext features. Maintaining layout consistency with visual markup languages is a chore, particularly if you don't have any "macro" facility to bundle up frequently used layout constructs under a single name. Layout markup languages limit the degree to which documents can *usefully* be transported into other contexts, e.g. print. As an example, a single, but large, conceptual document may be broken into many transaction units. The transaction units will contain two distinct categories of links: hypertext links serving as cross references, and navigational links that parallel the document's structure. The latter are instantiated in very different ways in print versus on-line documents, but if you use HTML, the structure links are hardwired and indistinguishable (reliably) from hypertext links. Going from this to print is a hassles, and can be completely avoided by choosing a master format more suitable to the document, and generating derivatives from that. On the other hand, the spirit of hypertext is all about doing things that cannot be done in print, so what is the problem? Why bother with these legacy "linear" document structures? It turns out that humans simply do not deal well with looking at a single transaction unit if do not have some mental construction of its structural context. So, while we can take full advantage of hypertext links between two points in on-line document, it is critical to provide cues not only where a link goes to before the user follows it, but to provide cues of the intervening "space". By adopting familar document structures, we enable the reader to use their highly tuned schemata for dealing with information. Pick up a novel, a technical manual, a newspaper, a research journal, a magazine and you immediately and unconsciously engage reading strategies that are highly optimized for each. The structure of ad-hoc hypertext can never provide the user with such cues. HTML, by focusing the authoring on the *transactional* unit rather than the *conceptual* unit, makes creating coherent, familar structures out of numerous transactional units unnecessairly difficult. Now, with that as a background, condisder the content of the FreeBSD web site and documentation. A bulk of the site is technical documentation in a traditional sense and *should* leverage the familiar technical manual structure. Since this is hard to do well directly in HTML, I suggest that any documentation be written in something more suitable such as docbook or linuxdoc. This lets the author focus on the conceptual structure of the document, and leave the transactional structure to a mechanical process. The role of authoring in HTML is essentially to provide a route *to* the documents; a directory service of sorts (but with the obligatory web glitz). Basically, if there is substantial content and you are doing it directly in HTML rather than generating HTML from a more suitable format, something is wrong. :) Dropping down to technical issues, the HTML source that makes up the web pages leverages SGML features that are not supported by current web browsers. Basically, involves defining entities for "boilerplate" text which can be included in many HTML documents. Currently various stylistic elements such as standard page headers, footers, colors, graphics and the like are defined in one location and in HTML documents using entities. Since web browsers cannot deal with arbitrary entities, the build process runs each file through an SGML normalizer which resolves all entity references (and validates the markup in the process). You can think of it as operating like the C preprocessor. The end product is HTML that fully conforms to whatever HTML spec we choose (currently HTML 3.2) that any respectable browser should render in a useful fashion. For a more rambling description with examples, look at: http://fallout.campusview.indiana.edu/~jfieber/sum1996/l577 About the state of linuxdoc and docbook. The linuxdoc DTD was derived from a dtd called QWERTZ which is essentially an SGMLized LaTeX. Most of the tags are directly derived from their corresponding LaTeX control sequences. While linuxdoc allows reasonably graceful handling of large documents (eg. the FreeBSD handbook), its typesetting heritage shows through strongly in the abundance of visual markup tags and the lack of descriptive markup beyond ultra-general things such as sectioning tags (chapter, sect, etc.). Also, the DTD is poorly implemented. It makes heavy use of tag minimization and short references which I have observed as the source of many markup errors. What linuxdoc has going for it (currently) is easy generation of servicable HTML, and groff (and consequently postscript, text and in theory, PCL and DVI). The Docbook DTD, on the other hand, was designed explicitly for software documentation. It offers rich options for descriptive markup--so rich as to be potentially overwhelming at times. The implementation of the DTD is truly exemplary, and it is actively supported by the likes of Fujitsu, Microsoft, DEC, Sun, SCO, O'Reilly, ArborText and SoftQuad. The downside is that generation of derivative formats is not as far along as linuxdoc. Decent HTML generation is possible for relatively small documents only, as I have yet to decide on how best to break up large documents into transaction units. The method used for linuxdoc works, but is fairly crude and I'd like to do a better job with Docbook. Docbook to groff is non-existant at the moment, although usable RTF can be generated using a DSSSL style sheet. As for specific FreeBSD documents, semi-automatic linuxdoc to docbook conversion is a current reality, although I have not (and don't plan to) put it in FreeBSD-current. What is holding up converting the handbook and FAQ is simply the lack of groff support and breaking up large documents. However, I don't think either of these should stand in the way of creating smaller documents using docbook. Changes to the handbook should be made to the existing copy, but I would like to see completely new additions written as tutorials to be integrated at a later date. I am also considering splitting the handbook into a couple smaller volumes. In summary, our web services are composed of: Documentation: FAQ, Handbook, tutorials authored in docbook and/or linuxdoc, converted to HTML. This should shortly be expanded to include the roff documents in the doc tree, converted to HTML. Database: Mailing list archives, web site searching, GNATS, cvs repository, ports collection. This is all more or less automatically generated from non-html sources as the documentation is. HTML: Provides small bits of content. Mostly serves as an access mechanism to the Documentation and Database services. -john