From owner-freebsd-doc Mon Mar 6 0:56:15 2000 Delivered-To: freebsd-doc@freebsd.org Received: from nothing-going-on.demon.co.uk (nothing-going-on.demon.co.uk [193.237.89.66]) by hub.freebsd.org (Postfix) with ESMTP id D90D537BCA7; Mon, 6 Mar 2000 00:55:56 -0800 (PST) (envelope-from nik@nothing-going-on.demon.co.uk) Received: (from nik@localhost) by nothing-going-on.demon.co.uk (8.9.3/8.9.3) id CAA10267; Mon, 6 Mar 2000 02:14:56 GMT (envelope-from nik) Date: Mon, 6 Mar 2000 02:14:55 +0000 From: Nik Clayton To: "Andrey A. Chernov" Cc: doc@freebsd.org, www@freebsd.org, phantom@freebsd.org, ru@freebsd.org Subject: Re: SGML->HTML: entities translation is broken for non-Latin1 charsets Message-ID: <20000306021454.A87062@catkin.nothing-going-on.org> References: <20000304134300.A24194@nagual.pp.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4i In-Reply-To: <20000304134300.A24194@nagual.pp.ru>; from Andrey A. Chernov on Sat, Mar 04, 2000 at 01:43:02PM +0300 Organization: FreeBSD Project Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Sat, Mar 04, 2000 at 01:43:02PM +0300, Andrey A. Chernov wrote: > Looking at www.freebsd.org I found that sgml->html procedure replace > things like   © etc. with their Latin1 8bit hardcoded values :-( This is done by sgmlnorm. Last time this issue came up I didn't have a good fix for it either. . . Last time this came up, I spoke to the OpenJade maintainers, and got a reply back from Matthias Clasen who said; > sgmlnorm is not designed to do what you request. which might be true, but doesn't really help us. I've done some more digging, and I can at least point people in the right direction. I don't have the necessary skills to fix this, but perhaps the following will lead someone in the right direction. First off, sgmlnorm is part of Jade, and it's written in C++, which complicates things mightily. I'm no C++ programmer, so I'm extrapolating from my C and Perl knowledge here. . . If you look in jade/style/sdata.h, you'll see an array that lists entity numbers to entity names. This is the root cause of the problem, and a typical line from that file is { 0x00A9, "copy" }, which is why "©" becomes "\a9" when a file is processed by sgmlnorm. This file is used in jade/style/Interpreter.cxx to build an array of structs, in this piece of code; -- void Interpreter::installSdata() { // This comes from uni2sgml.txt on ftp://unicode.org. // It is marked there as obsolete, so it probably ought to be checked. // The definitions of apos and quot have been fixed for consistency with XML. static struct { Char c; const char *name; } entities[] = { #include "sdata.h" }; for (size_t i = 0; i < SIZEOF(entities); i++) sdataEntityNameTable_.insert(makeStringC(entities[i].name), entities[i].c); } -- I assume that's building a lookup table, to map entity names to their corresponding character codes. The only other place sdataEntityNameTable is used is in the Interpreter::sdataMap method. That function is passed the entity name, and a reference to a character to output, and alters the reference as necessary, based upon the sdataEntityNameTable map. The logic seems to be: 1. If the entity name is in sdataEntityNameTable then lookup its replacement (e.g., "\a9") and return. 2. If it's not there, call convertUnicodeCharName() on it. This is also defined in Interpreter.cxx, and is a simple switch(). 3. If that step failed, return defaultChar, which seems to 0xfffd. Most of the time, step (1) is going to succeed. As you can see, this code is designed to convert entity names to their numeric references (actually, to C++ chars), and a quick glance at the surrounding and calling code shows that the assumption that the reference passed to sdataMap is a single character is deeply embedded. Changing it will probably touch quite a lot of code. Working backwards, the single character (Char c_) is defined in the SdataNode class (a subclass of EntityRefNode) in spgrove/GroveBuilder.cxx. The single character is private to the class, and can only be accessed through the SdateNode::charChunk method. A quick grep through the source tree shows lots of calls to charChunk() :-( After that, I get a bit lost. I haven't got the tools here to hold a full class hierarchy in my head. . . But that's a start, if anyone wants to do some digging. N -- Internet connection, $19.95 a month. Computer, $799.95. Modem, $149.95. Telephone line, $24.95 a month. Software, free. USENET transmission, hundreds if not thousands of dollars. Thinking before posting, priceless. Somethings in life you can't buy. For everything else, there's MasterCard. -- Graham Reed, in the Scary Devil Monastery To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message