From owner-freebsd-doc Wed Mar 8 2:24:36 2000 Delivered-To: freebsd-doc@freebsd.org Received: from sv01.geocities.co.jp (sv01.geocities.co.jp [210.153.89.155]) by hub.freebsd.org (Postfix) with ESMTP id 985C937B919; Wed, 8 Mar 2000 02:24:32 -0800 (PST) (envelope-from hrs@geocities.co.jp) Received: from mail.geocities.co.jp (mail.geocities.co.jp [210.153.89.137]) by sv01.geocities.co.jp (8.9.3+3.2W/3.7W) with ESMTP id TAA27164; Wed, 8 Mar 2000 19:24:19 +0900 (JST) Received: from mail.hrs.jp (sutnmax2-ppp23.ed.noda.sut.ac.jp [133.31.173.93]) by mail.geocities.co.jp (1.3G-GeocitiesJ-3.3) with ESMTP id TAA24457; Wed, 8 Mar 2000 19:24:16 +0900 (JST) Message-Id: <200003081024.TAA24457@mail.geocities.co.jp> Received: from localhost (alph.hrs.jp [192.168.0.10]) by mail.hrs.jp (8.9.3/3.7W/DomainMaster) with ESMTP id TAA20149; Wed, 8 Mar 2000 19:20:37 +0900 (JST) (envelope-from hrs@hrs.jp) To: ache@nagual.pp.ru, phantom@FreeBSD.ORG Cc: doc@FreeBSD.ORG Subject: Re: SGML->HTML: entities translation is broken for non-Latin1 charsets In-Reply-To: <20000306130945.A92757@nagual.pp.ru> References: <20000306003545.A90564@nagual.pp.ru> <20000305151810.A200@scorpion.crimea.ua> <20000306130945.A92757@nagual.pp.ru> <20000305203633.A89852@nagual.pp.ru> X-Mailer: Mew version 1.94 on Emacs 19.34 / Mule 2.3 (SUETSUMUHANA) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Wed, 08 Mar 2000 19:20:36 +0900 From: Hiroki Sato X-Dispatcher: imput version 990905(IM130) Lines: 59 Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org "Andrey A. Chernov" wrote in <20000306130945.A92757@nagual.pp.ru>: > > I tried and it works as expected. I have commited fix 10 minutes ago. Please > > report me any problems with russian web pages (keep in mind -- www server tree > > is rebuilding each 24 hours) > > 24h not passed yet, so I'll wait. Meanwhile here is following suggestions: > 1) Add ® ° ™ to this list as commonly used ones. > 2) Add this directives not to ru/includes.sgml only but to all > */includes.sgml (i.e. en/ es/ ja/ etc.) There is another problem for using 8bit code in FreeBSD Handbook/FAQ. Both of them are processed by jade and tidy, but tidy cannot handle the entities like © as expected. Tidy has -raw option for input of 8bit characters, so Japanese-doc has to use this (in TIDYFLAGS). However, this *always* replace the >127 entities like © with the corresponding 8bit values. When tidy reads an entity such as ©, the entity is converted into a character of #169(in the case of ©) internally, then output as a form of © again. The option -raw suppresses the last conversion of >127 characters, so < is output as an entity <, but   is output as raw code #160. You can try the following commands to confirm it: % echo "[©<]" | tidy | grep "\[" | hexdump -C % echo "[©<]" | tidy -raw | grep "\[" | hexdump -C Russian FAQ also has this problem. Try to build with -DNO_TIDY and make sure of a number of   in book.html like this: % cd doc/ru_RU.KOI8-R/books/faq % make -DNO_TIDY; grep " " book.html | wc % make clean all; grep " " book.html | wc "Andrey A. Chernov" wrote in <20000305203633.A89852@nagual.pp.ru>: > Alexey Zelkin just inform me that   preserved in FAQ (good news!), so Really? I can find #160 raw characters around

tag in http://www.freebsd.org/ru/FAQ/preface.html. This problem is unavoidable as long as we use the current version of tidy. We can build doc with NO_TIDY flag to avoid the problem tentatively (actually do so now in Japanese-doc), but I personally don't think this is a reasonable way. To tell the truth, this was pointed out and submitted a patch to fix it by Kuriyama-san before. It seemed that tidy developers didn't think it an important issue. -- | Hiroki Sato/HRS | | j7397067@ed.noda.sut.ac.jp(univ) | hrs@jp.FreeBSD.org(FreeBSD doc-jp Project) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message