From owner-freebsd-doc@FreeBSD.ORG Sun Feb 22 11:26:02 2004 Return-Path: Delivered-To: freebsd-doc@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A70AC16A4CE; Sun, 22 Feb 2004 11:26:02 -0800 (PST) Received: from out006.verizon.net (out006pub.verizon.net [206.46.170.106]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6A9ED43D1D; Sun, 22 Feb 2004 11:26:02 -0800 (PST) (envelope-from chuck@pkix.net) Received: from pkix.net ([68.160.202.196]) by out006.verizon.net (InterMail vM.5.01.06.06 201-253-122-130-106-20030910) with ESMTP id <20040222192601.ZXIH1634.out006.verizon.net@pkix.net>; Sun, 22 Feb 2004 13:26:01 -0600 Message-ID: <40390248.1060104@pkix.net> Date: Sun, 22 Feb 2004 14:26:00 -0500 From: Chuck Swiger User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Thierry Thomas References: <8D03FA54-4BA6-11D8-8D97-003065ABFD92@pkix.net> <20040216130659.GC617@submonkey.net> <4031364A.2070708@pkix.net> <20040222181114.GB32524@graf.pompo.net> In-Reply-To: <20040222181114.GB32524@graf.pompo.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Authentication-Info: Submitted using SMTP AUTH at out006.verizon.net from [68.160.202.196] at Sun, 22 Feb 2004 13:26:01 -0600 cc: Ceri Davies cc: freebsd-doc@FreeBSD.org Subject: Re: Validating docbook articles... X-BeenThere: freebsd-doc@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Documentation project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Feb 2004 19:26:02 -0000 Thierry Thomas wrote: >Le Lun 16 fév 04 à 22:29:46 +0100, Chuck Swiger > écrivait : > > >>...tidy-devel doesn't understand the -preserve option. Something like the >>following, as www/tidy-devel/files/patch-console-tidy.c: >> >> > >Some days ago, we were speaking of this option (with Alex Dupre). It >seems useful for documents encoded with charsets unsupported by Tidy. > > Hi, Thierry-- Thanks for your response and interest in the change I suggested. I would be happy to spend more time on this issue and "do the right thing" rather than just turn that option into a null operation. However, as you've noticed: >There exist two possibilities: > >- we encode all documents in supported charsets (e.g. UTF8), and this >option is not necessary (we can apply your patch to keep a compatibility >with old scripts); > >- we have documents written in such encodings, and tidy-devel should be >patched to actually preserve entities, or we have to keep the original >Tidy. > Your latter comment suggests that the -preserve functionality in tidy is no longer available in tidy-devel, which matches my own attempt when looking though the tidy-devel code to find a comparible flag to set, and not finding anything? Maybe we should ask the author, , or ...? I just checked, and the difference -preserve in the old version of tidy (vers 4th August 2000) makes is fairly common, tends to be things like angle brackets in email addresses. For example, the input source of:

...becomes either of (results compared via diff): -

+

However, the usage of > rather than > is purely a detail of encoding, and I am willing to use tidy-devel without having the -preserve capability. Although, then again now that I think about it, using © rather than &#A9; (I think?) is more portable-- the issue of whether 0xA9 actually is the copyright symbol in the particular character character set being used could be a problem. Isn't 0xA9 not the copyright symbol in one of UTF8 or ISO-8859-1? [ I ran into this issue using the W3C HTML validator as well. ] A broader issue is whether tidy should generate a charset declaration (particularly when used with -xml/-asxml), and what should it pick if the user and/or the source document doesn't specify one. I think it would be useful for tidy to do so by default... -- -Chuck