From owner-freebsd-ports Tue Aug 3 12:45:55 1999 Delivered-To: freebsd-ports@freebsd.org Received: from att.com (kcgw1.att.com [192.128.133.151]) by hub.freebsd.org (Postfix) with SMTP id 9E0401530F; Tue, 3 Aug 1999 12:45:51 -0700 (PDT) (envelope-from shalunov@att.com) Received: from kcig1.att.att.com by kcgw1.att.com (AT&T/IPNS/UPAS-1.0) for freebsd.org!jfieber freebsd.org!freebsd-ports sender att.com!shalunov (att.com!shalunov); Tue Aug 3 14:45 CDT 1999 Received: from tuzik.lz.att.com (tuzik.lz.att.com [135.25.200.84]) by kcig1.att.att.com (AT&T/IPNS/GW-1.0) with ESMTP id OAA08796; Tue, 3 Aug 1999 14:45:29 -0500 (CDT) Received: (from shalunov@localhost) by tuzik.lz.att.com (8.9.2/8.9.2) id PAA11536; Tue, 3 Aug 1999 15:48:03 -0400 (EDT) (envelope-from shalunov@att.com) Date: Tue, 3 Aug 1999 15:48:03 -0400 (EDT) Message-Id: <199908031948.PAA11536@tuzik.lz.att.com> From: stanislav shalunov To: jfieber@FreeBSD.org Cc: freebsd-ports@FreeBSD.org Subject: sgmlfmt: producing text files Sender: owner-freebsd-ports@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org John, I'm using sgmlfmt (Id: sgmlfmt.pl,v 1.26 1997/05/12 14:16:48 jfieber Exp, the version that came with 3.1-RELEASE) to format my SGML linuxdoc documents. I need to be able to produce plain text output. I noticed that "sgmlfmt -f ascii" goes through groff to produce the text. Unfortunately, this mean that the file will be formatted all right for printing on a line printer, but that's the least likely use of a text file (it would rather be used for Usenet postings, emailing, etc.; if I wanted to print, I'd produce PostScript!). The disadvantages of using groff to produce text are: * Underlined/bold text (easily fixed with "ul -l dumb"); * Headings/footings (not so easily fixed, because one needs to extract the title, decode entities, etc.); * Hyphenations: this makes text not searchable, and spell-checking won't work. It's the accepted practice to just wrap the lines on word boundaries. I found that I can get much better results by editing sgmlfmt so that $maxlevel=0, producing HTML file, and then doing "lynx -dump -nolist". I also found that for moderate size documents, I *don't* want to have have them split in multiple files, so maxlevel 0 seems a very reasonable default for HTML generation as well. In short: Suggestion one: Produce text files from HTML (using "lynx -dump -nolist"). Suggestion two: Make $maxlevel configurable from the command line (I think latex2html uses option name "-split" for the variable with this meaning, just in case you want to be consistent with something). Bug report: When making HTML from a linuxdoc file that has tag in , email is not handled correctly. Question: What about producing LaTeX output? I would very much prefer to have LaTeX formatting to the shitty paragraphs produced by groff (no offense to groff, but TeX paragraph formatting and hyphenation algorithms are just much better). Seems like the script was last updated a lo-o-ong time ago, do you still support it? --Stanislav To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-ports" in the body of the message