From owner-freebsd-doc Thu Nov 15 3:34: 4 2001 Delivered-To: freebsd-doc@freebsd.org Received: from dual.ms.mff.cuni.cz (www.freebsd.cz [195.113.19.84]) by hub.freebsd.org (Postfix) with ESMTP id C017237B405 for ; Thu, 15 Nov 2001 03:34:01 -0800 (PST) Received: from localhost (horcicka@localhost) by dual.ms.mff.cuni.cz (8.11.3/8.11.1) with ESMTP id fAFBY0161654 for ; Thu, 15 Nov 2001 12:34:00 +0100 (CET) (envelope-from horcicka@FreeBSD.cz) Date: Thu, 15 Nov 2001 12:34:00 +0100 (CET) From: Martin Horcicka X-X-Sender: To: Subject: Why TIDY can never work correctly with ISO-8859-2 and others Message-ID: <20011115105650.W57038-100000@dual.ms.mff.cuni.cz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Hi, Tidy simply cannot be used correctly with (e.g.) 8-bit character sets other than Latin 1 because it does not support them. Consider HTML document in (e.g.) ISO-8859-2 encoding and some central European characters and a © entity in it. The default behavior of Tidy (char-encoding: ascii) is to use character entities instead of all non-ascii characters - it takes the central european character and encodes it as entity with the same value but interpreted (as defined by HTML specification) in ISO-8859-1 (resp. Unicode)! If you use char-encoding: latin1 - the © entity is converted to a normal character with the same value - but in ISO-8859-2! And if you use char-encoding: raw - character entities with values above 255 are not printed as entities - this is really bad in 8-bit encodings. In my opinion Tidy cannot be used for encodings it does not natively support (i.e. for Russian and Czech (- still not in main CVS) translations of pages and docs). Martin To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message