Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Feb 2004 14:26:00 -0500
From:      Chuck Swiger <chuck@pkix.net>
To:        Thierry Thomas <thierry@pompo.net>
Cc:        freebsd-doc@FreeBSD.org
Subject:   Re: Validating docbook articles...
Message-ID:  <40390248.1060104@pkix.net>
In-Reply-To: <20040222181114.GB32524@graf.pompo.net>
References:  <8D03FA54-4BA6-11D8-8D97-003065ABFD92@pkix.net> <20040216130659.GC617@submonkey.net> <4031364A.2070708@pkix.net> <20040222181114.GB32524@graf.pompo.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Thierry Thomas wrote:

>Le Lun 16 fév 04 à 22:29:46 +0100, Chuck Swiger <chuck@pkix.net>
> écrivait :
>  
>
>>...tidy-devel doesn't understand the -preserve option.  Something like the 
>>following, as www/tidy-devel/files/patch-console-tidy.c:
>>    
>>
>
>Some days ago, we were speaking of this option (with Alex Dupre). It
>seems useful for documents encoded with charsets unsupported by Tidy.
>  
>

Hi, Thierry--

Thanks for your response and interest in the change I suggested.  I 
would be happy to spend more time on this issue and "do the right thing" 
rather than just turn that option into a null operation.  However, as 
you've noticed:

>There exist two possibilities:
>
>- we encode all documents in supported charsets (e.g. UTF8), and this
>option is not necessary (we can apply your patch to keep a compatibility
>with old scripts);
>
>- we have documents written in such encodings, and tidy-devel should be
>patched to actually preserve entities, or we have to keep the original
>Tidy.
>
Your latter comment suggests that the -preserve functionality in tidy is 
no longer available in tidy-devel, which matches my own attempt when 
looking though the tidy-devel code to find a comparible flag to set, and 
not finding anything?  Maybe we should ask the author, <dsr@w3.org>, or 
<html-tidy@w3.org>...?

I just checked, and the difference -preserve in the old version of tidy 
(vers 4th August 2000) makes is fairly common, tends to be things like 
angle brackets in email addresses.  For example, the input source of:

<P CLASS="ADDRESS"><CODE CLASS="EMAIL">&#60;
<A HREF="mailto:chuck@pkix.net">chuck@pkix.net</A>&#62;</CODE></P>

...becomes either of (results compared via diff):

-<p class="ADDRESS"><code class="EMAIL">&lt;<a href=
-"mailto:chuck@pkix.net">chuck@pkix.net</a>&gt;</code></p>
+<p class="ADDRESS"><code class="EMAIL">&#60;<a href=
+"mailto:chuck@pkix.net">chuck@pkix.net</a>&#62;</code></p>

However, the usage of &gt; rather than &#62; is purely a detail of 
encoding, and I am willing to use tidy-devel without having the 
-preserve capability.

Although, then again now that I think about it, using &copy; rather than 
&#A9; (I think?) is more portable-- the issue of whether 0xA9 actually 
is the copyright symbol in the particular character character set being 
used could be a problem.  Isn't 0xA9 not the copyright symbol in one of 
UTF8 or ISO-8859-1?  [ I ran into this issue using the W3C HTML 
validator as well. ]

A broader issue is whether tidy should generate a charset declaration 
(particularly when used with -xml/-asxml), and what should it pick if 
the user and/or the source document doesn't specify one.  I think it 
would be useful for tidy to do so by default...

-- 
-Chuck



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40390248.1060104>