Date: Wed, 08 Mar 2000 19:20:36 +0900 From: Hiroki Sato <hrs@geocities.co.jp> To: ache@nagual.pp.ru, phantom@FreeBSD.ORG Cc: doc@FreeBSD.ORG Subject: Re: SGML->HTML: entities translation is broken for non-Latin1 charsets Message-ID: <200003081024.TAA24457@mail.geocities.co.jp> In-Reply-To: <20000306130945.A92757@nagual.pp.ru> References: <20000306003545.A90564@nagual.pp.ru> <20000305151810.A200@scorpion.crimea.ua> <20000306130945.A92757@nagual.pp.ru> <20000305203633.A89852@nagual.pp.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
"Andrey A. Chernov" <ache@nagual.pp.ru> wrote in <20000306130945.A92757@nagual.pp.ru>: > > I tried and it works as expected. I have commited fix 10 minutes ago. Please > > report me any problems with russian web pages (keep in mind -- www server tree > > is rebuilding each 24 hours) > > 24h not passed yet, so I'll wait. Meanwhile here is following suggestions: > 1) Add ® ° ™ to this list as commonly used ones. > 2) Add this directives not to ru/includes.sgml only but to all > */includes.sgml (i.e. en/ es/ ja/ etc.) There is another problem for using 8bit code in FreeBSD Handbook/FAQ. Both of them are processed by jade and tidy, but tidy cannot handle the entities like © as expected. Tidy has -raw option for input of 8bit characters, so Japanese-doc has to use this (in TIDYFLAGS). However, this *always* replace the >127 entities like © with the corresponding 8bit values. When tidy reads an entity such as ©, the entity is converted into a character of #169(in the case of ©) internally, then output as a form of © again. The option -raw suppresses the last conversion of >127 characters, so < is output as an entity <, but is output as raw code #160. You can try the following commands to confirm it: % echo "[©<]" | tidy | grep "\[" | hexdump -C % echo "[©<]" | tidy -raw | grep "\[" | hexdump -C Russian FAQ also has this problem. Try to build with -DNO_TIDY and make sure of a number of in book.html like this: % cd doc/ru_RU.KOI8-R/books/faq % make -DNO_TIDY; grep " " book.html | wc % make clean all; grep " " book.html | wc "Andrey A. Chernov" <ache@nagual.pp.ru> wrote in <20000305203633.A89852@nagual.pp.ru>: > Alexey Zelkin just inform me that preserved in FAQ (good news!), so Really? I can find #160 raw characters around <p class="LITERALLAYOUT"> tag in http://www.freebsd.org/ru/FAQ/preface.html. This problem is unavoidable as long as we use the current version of tidy. We can build doc with NO_TIDY flag to avoid the problem tentatively (actually do so now in Japanese-doc), but I personally don't think this is a reasonable way. To tell the truth, this was pointed out and submitted a patch to fix it by Kuriyama-san before. It seemed that tidy developers didn't think it an important issue. -- | Hiroki Sato/HRS <hrs@geocities.co.jp> | | j7397067@ed.noda.sut.ac.jp(univ) | hrs@jp.FreeBSD.org(FreeBSD doc-jp Project) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003081024.TAA24457>