From owner-freebsd-doc  Sat Feb 15 12:23:35 1997
Return-Path: <owner-doc>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id MAA11863
          for doc-outgoing; Sat, 15 Feb 1997 12:23:35 -0800 (PST)
Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id MAA11847
          for <doc@freebsd.org>; Sat, 15 Feb 1997 12:23:19 -0800 (PST)
Received: from localhost (jfieber@localhost)
	by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id PAA19290;
	Sat, 15 Feb 1997 15:23:18 -0500 (EST)
Date: Sat, 15 Feb 1997 15:23:18 -0500 (EST)
From: John Fieber <jfieber@indiana.edu>
To: doc@freebsd.org
cc: "Jordan K. Hubbard" <jkh@time.cdrom.com>
Subject: Re: cvs commit: www/data security.sgml 
In-Reply-To: <11604.856026076@time.cdrom.com>
Message-ID: <Pine.BSF.3.95q.970215130334.14200E-100000@fallout.campusview.indiana.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-doc@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

I suggested to Jordan that his recent "FreeBSD" security
document, which was added to the web server as an HTML page,
should be written up as a tutorial using Docbook.

He responds with:

> Is docbook something I want to employ *only* for tutorials and
> handbook pieces (when the handbook goes docbook, that is), and if this
> had stayed up where auditors.sgml is, would the format in which I
> originally wrote it be considered "correct?"  This HTML-wrapped-in-SGML
> stuff we have up at the top level of the web page hierarchy still sort
> of confuses me, I'll admit, and I'm not sure to what degree it's wise or
> allowable to use more specialized SGML features in there, or even what
> "styles" are appropriate for which areas of the doc tree at the
> moment.  Maybe a short paragraph, sent to www for the benefit of
> everyone, during these transitionary times? :-)

Since the web pages have recently become a much more communal project
than they were, some explanation is definately in order!  Two versions
are supplied.  Pick the one that matches your attention span.
Those opting out of the long version would be advised to skip to
the end for the summary however. :)

Short version:

Any document that will only see the light of day in the context
of a (the) web server is best written in HTML.  Any document that
might have a useful life in any context outside the web serve
should probably not be written in HTML.  The viable alternatives
are linuxdoc (discouraged) and docbook (encouraged).  The primary role
of documents written directly in HTML is to provide an access path to
these documents.

Long version:

An HTML file is the standard web transaction unit.

For a variety of reasons, writing and maintaining documents in
transaction unit chunks is a chore unless the conceptual content
fits neatly in one, a very small handful of transactional units.

Despite claims to the contrary, and its original intent, HTML is
primairly a layout markup language with hypertext features.
Maintaining layout consistency with visual markup languages is a
chore, particularly if you don't have any "macro" facility to
bundle up frequently used layout constructs under a single name.

Layout markup languages limit the degree to which documents can
*usefully* be transported into other contexts, e.g. print. 

As an example, a single, but large, conceptual document may be broken
into many transaction units.  The transaction units will contain two
distinct categories of links: hypertext links serving as cross
references, and navigational links that parallel the document's
structure.  The latter are instantiated in very different ways in
print versus on-line documents, but if you use HTML, the structure
links are hardwired and indistinguishable (reliably) from hypertext
links.  Going from this to print is a hassles, and can be completely
avoided by choosing a master format more suitable to the document, and
generating derivatives from that.

On the other hand, the spirit of hypertext is all about doing
things that cannot be done in print, so what is the problem?  Why
bother with these legacy "linear" document structures?

It turns out that humans simply do not deal well with looking at a
single transaction unit if do not have some mental construction of
its structural context.  So, while we can take full advantage of
hypertext links between two points in on-line document, it is
critical to provide cues not only where a link goes to before the
user follows it, but to provide cues of the intervening "space".  By
adopting familar document structures, we enable the reader to use
their highly tuned schemata for dealing with information.  Pick up a
novel, a technical manual, a newspaper, a research journal, a
magazine and you immediately and unconsciously engage reading
strategies that are highly optimized for each.  The structure of
ad-hoc hypertext can never provide the user with such cues.

HTML, by focusing the authoring on the *transactional* unit rather
than the *conceptual* unit, makes creating coherent, familar
structures out of numerous transactional units unnecessairly
difficult.

Now, with that as a background, condisder the content of the FreeBSD
web site and documentation.  A bulk of the site is technical
documentation in a traditional sense and *should* leverage the
familiar technical manual structure.  Since this is hard to do well
directly in HTML, I suggest that any documentation be written in
something more suitable such as docbook or linuxdoc.  This lets the
author focus on the conceptual structure of the document, and leave
the transactional structure to a mechanical process.

The role of authoring in HTML is essentially to provide a route
*to* the documents; a directory service of sorts (but with the
obligatory web glitz).  Basically, if there is substantial
content and you are doing it directly in HTML rather than
generating HTML from a more suitable format, something is wrong. 
:)

Dropping down to technical issues, the HTML source that makes up the
web pages leverages SGML features that are not supported by current
web browsers.  Basically, involves defining entities for "boilerplate"
text which can be included in many HTML documents.  Currently various
stylistic elements such as standard page headers, footers, colors,
graphics and the like are defined in one location and in HTML
documents using entities.

Since web browsers cannot deal with arbitrary entities, the build
process runs each file through an SGML normalizer which resolves all
entity references (and validates the markup in the process).  You can
think of it as operating like the C preprocessor.  The end product is
HTML that fully conforms to whatever HTML spec we choose (currently
HTML 3.2) that any respectable browser should render in a useful
fashion. For a more rambling description with examples, look at:

  http://fallout.campusview.indiana.edu/~jfieber/sum1996/l577

About the state of linuxdoc and docbook.  The linuxdoc DTD was
derived from a dtd called QWERTZ which is essentially an SGMLized
LaTeX.  Most of the tags are directly derived from their
corresponding LaTeX control sequences. While linuxdoc allows
reasonably graceful handling of large documents (eg. the FreeBSD
handbook), its typesetting heritage shows through strongly in the
abundance of visual markup tags and the lack of descriptive markup
beyond ultra-general things such as sectioning tags (chapter, sect,
etc.).  Also, the DTD is poorly implemented.  It makes heavy use of
tag minimization and short references which I have observed as the
source of many markup errors.

What linuxdoc has going for it (currently) is easy generation of
servicable HTML, and groff (and consequently postscript, text and in
theory, PCL and DVI).

The Docbook DTD, on the other hand, was designed explicitly for
software documentation.  It offers rich options for descriptive
markup--so rich as to be potentially overwhelming at times.  The
implementation of the DTD is truly exemplary, and it is actively
supported by the likes of Fujitsu, Microsoft, DEC, Sun, SCO, O'Reilly,
ArborText and SoftQuad.

The downside is that generation of derivative formats is not as far
along as linuxdoc.  Decent HTML generation is possible for relatively
small documents only, as I have yet to decide on how best to break up
large documents into transaction units.  The method used for linuxdoc
works, but is fairly crude and I'd like to do a better job with
Docbook.  Docbook to groff is non-existant at the moment, although
usable RTF can be generated using a DSSSL style sheet.


As for specific FreeBSD documents, semi-automatic linuxdoc to docbook
conversion is a current reality, although I have not (and don't plan
to) put it in FreeBSD-current.  What is holding up converting the
handbook and FAQ is simply the lack of groff support and breaking up
large documents.  However, I don't think either of these should stand
in the way of creating smaller documents using docbook.   Changes to
the handbook should be made to the existing copy, but I would like to
see completely new additions written as tutorials to be integrated at
a later date.

I am also considering splitting the handbook into a couple smaller
volumes.


In summary, our web services are composed of:

Documentation: 
    FAQ, Handbook, tutorials authored in docbook and/or linuxdoc,
    converted to HTML.  This should shortly be expanded to include
    the roff documents in the doc tree, converted to HTML.

Database:
    Mailing list archives, web site searching, GNATS, cvs repository,
    ports collection.  This is all more or less automatically generated from
    non-html sources as the documentation is.

HTML:
    Provides small bits of content.  Mostly serves as an access
    mechanism to the Documentation and Database services.


-john