Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 08 Mar 2000 19:20:36 +0900
From:      Hiroki Sato <hrs@geocities.co.jp>
To:        ache@nagual.pp.ru, phantom@FreeBSD.ORG
Cc:        doc@FreeBSD.ORG
Subject:   Re: SGML->HTML: entities translation is broken for non-Latin1 charsets
Message-ID:  <200003081024.TAA24457@mail.geocities.co.jp>
In-Reply-To: <20000306130945.A92757@nagual.pp.ru>
References:  <20000306003545.A90564@nagual.pp.ru> <20000305151810.A200@scorpion.crimea.ua> <20000306130945.A92757@nagual.pp.ru> <20000305203633.A89852@nagual.pp.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
"Andrey A. Chernov" <ache@nagual.pp.ru> wrote
 in <20000306130945.A92757@nagual.pp.ru>:

> > I tried and it works as expected. I have commited fix 10 minutes ago. Please
> > report me any problems with russian web pages (keep in mind -- www server tree 
> > is rebuilding each 24 hours)
> 
> 24h not passed yet, so I'll wait. Meanwhile here is following suggestions:
> 1) Add &reg; &deg; &trade; to this list as commonly used ones.
> 2) Add this directives not to ru/includes.sgml only but to all
> */includes.sgml (i.e. en/ es/ ja/ etc.)

 There is another problem for using 8bit code in FreeBSD Handbook/FAQ.
 Both of them are processed by jade and tidy, but tidy cannot handle
 the entities like &copy; as expected.

 Tidy has -raw option for input of 8bit characters, so Japanese-doc
 has to use this (in TIDYFLAGS).  However, this *always* replace
 the >127 entities like &copy; with the corresponding 8bit values.
 When tidy reads an entity such as &copy;, the entity is converted
 into a character of #169(in the case of &copy;) internally, then
 output as a form of &copy; again.  The option -raw suppresses
 the last conversion of >127 characters, so &lt; is output as
 an entity &lt;, but &nbsp; is output as raw code #160.

 You can try the following commands to confirm it:

 % echo "[&copy;&lt;]" | tidy      | grep "\[" | hexdump -C
 % echo "[&copy;&lt;]" | tidy -raw | grep "\[" | hexdump -C

 Russian FAQ also has this problem.  Try to build with -DNO_TIDY
 and make sure of a number of &nbsp; in book.html like this:

 % cd doc/ru_RU.KOI8-R/books/faq
 % make -DNO_TIDY; grep "&nbsp;" book.html | wc
 % make clean all; grep "&nbsp;" book.html | wc 

"Andrey A. Chernov" <ache@nagual.pp.ru> wrote
 in <20000305203633.A89852@nagual.pp.ru>:

> Alexey Zelkin just inform me that &nbsp; preserved in FAQ (good news!), so

 Really?  I can find #160 raw characters around <p class="LITERALLAYOUT"> tag
 in http://www.freebsd.org/ru/FAQ/preface.html.

 This problem is unavoidable as long as we use the current version
 of tidy.  We can build doc with NO_TIDY flag to avoid the problem
 tentatively (actually do so now in Japanese-doc), but I personally
 don't think this is a reasonable way.

 To tell the truth, this was pointed out and submitted a patch to
 fix it by Kuriyama-san before.  It seemed that tidy developers
 didn't think it an important issue.

--
| Hiroki Sato/HRS <hrs@geocities.co.jp>
|
|                                  j7397067@ed.noda.sut.ac.jp(univ)
|                        hrs@jp.FreeBSD.org(FreeBSD doc-jp Project)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003081024.TAA24457>