Date: Fri, 09 Jul 2004 10:32:57 +0900 From: NAKAJI Hiroyuki <nakaji@jp.freebsd.org> To: Alexander Nedotsukov <bland@FreeBSD.org> Cc: gnome@FreeBSD.org Subject: Re: converters/libiconv change request for net/samba3 Message-ID: <87zn6a2cdy.fsf@roddy.acest.tutrp.tut.ac.jp> In-Reply-To: <40EB9607.6020906@FreeBSD.org> (Alexander Nedotsukov's message of "Wed, 07 Jul 2004 15:19:51 %2B0900") References: <87acyd8zg0.fsf@roddy.acest.tutrp.tut.ac.jp> <40EA57EB.4060607@FreeBSD.org> <871xjp8sim.fsf@roddy.acest.tutrp.tut.ac.jp> <87fz84lfaw.fsf@roddy.acest.tutrp.tut.ac.jp> <40EB9607.6020906@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Mr. Iijima advised me again in ports-jp@jp.freebsd.org, but he is shy to post gnome@freebsd.org. <cite> > If Microsoft called some hacked Shift_JIS version Shift_JIS > it doesn't make it valid for the rest of the world. Absolutely right, in a context in which JIS-Unicode mapping does matter. In such a context, we cannot, and will not, call CP932 as Shift_JIS. But in practice, we have traditionally treated official Shift_JIS, Microsoft CP932, Apple Japanese set, Sun Java's "SJIS" encoding, etc. as identical to each other, whenever we do not use extra characters added by each vendor (for example, codepoints circa 0x85??-0x87??). Historically, JIS X0208 was born in 1978 and soon 'shift' encoding (what we now call Shift_JIS) and 8th-bit-on encoding (what we now call EUC) were invented. Of course there was no Unicode at that time and JIS did not describe precisely how each symbol is used. For instance, Shift_JIS 0x8166 (now all vendors map to U+2019 RIGHT SINGLE QUOTATION MARK) played two roles: right quotation mark and apostrophe. Despite that we now have U+FF07 FULLWIDTH APOSTROPHE! It is in 1997 that JIS X0208 was reformed to identify each symbol by English name and UTF codepoint, and it was too late. I heard that a draft was proposed in 1994, but virtually nobody knew it. So each vendor had designed its own JIS-Unicode mapping in the way that he/she thought is the best. Microsoft CP932 and Apple Japanese set are a few examples of such mappings, with addition of extra characters. # Apple Japanese set is available at: # http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT See Shift_JIS codepoint 0x81CA. Most tables map this symbol to U+00AC NOT SIGN, but Microsoft does to U+FFE2, to avoid conflict with OS/2 single-byte NOT SIGN. > For instance you can not have 1:1 mapping between cp932 and eucJP. You are right, if you applied the rule strictly. So we do not employ Unicode but we map CP932 or Shift_JIS characters to EUC-JP in this simple calculation, with no conversion table: Either MS, Apple, or Shift_JIS EUC-JP ----------------------------------------------- 0x00-0x7F (ASCII) -> 0x00-0x7F 0xA1-0xDF (JIS X0201 Katakana) -> 0x8EA1-0x8EDF 0x8140-0x819E (except 0x817F) -> 0xA1A1-0xA1FE 0x819F-0x81FC -> 0xA2A1-0xA2FE 0x8240-0x829E (ditto) -> 0xA3A1-0xA3FE 0x829F-0x82FC -> 0xA4A1-0xA4FE : 0x9F9F-0x9FFC -> 0xDEA1-0xDEFE 0xE040-0xE09E (ditto) -> 0xDFA1-0xDFFE : 0xEF9F-0xEFFC -> 0xFEFE-0xFEFE 0x80 and 0xFD-0xFF (Apple only)-> not supported (CP932 only) 0xF040-0xF9FC (private area) -> not supported 0xFA40-0xFC4B (IBM extention) -> few converters support them, but there are two ways: (a) every character here has its duplicate within the range of 0x8140-0xEFFC (namely 0x87?? and 0xED40-0xEEFC) for historical reason, so it can be unified to its counterpart, though this breaks round-trip conversion. (b) most characters here (perhaps all) are included in JIS X0212 (specified by 'ESC $ ( D' in 7-bit encodings and prefixed by 0x8F in EUC-JP), so you can convert them to X0212 characters if your applications support X0212. Extra characters by Microsoft or Apple are mapped to EUC-JP undefined codepoints, but we either use a font that supports such extras or totally eliminate these characters. </cite> >>>>> Alexander Nedotsukov <bland@FreeBSD.org> wrote: > Btw, are you guys pretty sure you problem comes form libiconv? I have > few japanese windows workstations here and if you like can check what's > wrong with them. Just give me a simple instructions how to reproduce a > problem in this case. Why I asking because I already saw false reports > about libiconv problems when people tried to convert windows client > encoding to samba's host encoding and this is not always possible. For > instance you can not have 1:1 mapping between cp932 and eucJP. And MORIYAMA Masayuki at MIRACLE LINUX CORPORATION showed me an example on vim6. Because he sent to me in Japanese and I translated it, there may be my mistake in translation. <cite> The step to reproduce wrong mapping. 1. Install libiconv-1.9.1 with the patch to add modified cp932 and eucJP-ms. http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.9.1-cp932.patch.gz 2. Install vim6 according to the explanation in http://pcmania.jp/~moraz/howto/install.html (written in Japanese) 3. Configure your ~/.vimrc ----------- set encoding=japan if has('iconv') set fileencodings+=iso-2022-jp set fileencodings+=utf-8,ucs-2le,ucs-2 if &encoding ==# 'euc-jp' set fileencodings+=cp932 else set fileencodings+=euc-jp endif endif ----------- 4. Run vim and open the Shift_JIS file, tmp.txt: ----------- 日本語〜 \~ ----------- 5. You can see "〜" is not displayed correctly. But this is an expected result from the RIGHT modified mapping of cp932 in libiconv. And you have to change your ~/.vimrc to use sjis not cp932: ----------- set encoding=japan if has('iconv') set fileencodings+=iso-2022-jp set fileencodings+=utf-8,ucs-2le,ucs-2 if &encoding ==# 'euc-jp' set fileencodings+=sjis else set fileencodings+=euc-jp endif endif ----------- 6. Open tmp.txt again, and then you can see the right contents. 7. After execute the ed command ":w!" in vim, you will get an error: ----------- "tmp.txt" "tmp.txt" E513: write error, conversion failed Hit ENTER or type command to continue ----------- Note: it is because the conversion 0x5C and 0x7E in euc-jp to 0x5C and 0x7E in Shift_JIS respectively is impossible with the implementation of original libiconv. For example, $ echo '\~' | /usr/local/bin/iconv -f euc-jp -t sjis iconv: (stdin): cannot convert About the "0x5C and 0x7E in Shift_JIS" problem, the page will be helpful: http://www.debian.or.jp/~kubota/unicode-symbols-yen.html.en The patch was made to solve such a problem described above. http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.9.1-cp932-jis.patch.gz </cite> Additional information from MORIYAMA-san, <cite> Because the problem in libiconv + vim6 is not related to Samba 3.0, there is no problem in Samba 3.0 with using http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.9.1-cp932.patch.gz which does not have JIS hack. The libiconv-1.9.1-cp932-jis.patch.gz must have the same conversion rules as "The conversion table of characters which has less harm in JIS-Unicode conversion": http://hp.vector.co.jp/authors/VA010341/unicode/ (Japanese) </cite> Finnaly, on my experience, 1. pkg_delete -f ja-samba-2.2.9-ja-1.0 2. (cd /usr/ports/net/samba3; make install ALWAYS_BUILD_DEPENDS=yes) 3. Configure /usr/local/etc/smb.conf --------- [global] dos charset = CP932 unix charset = EUCJP-MS display charset = CP932 netbios name = SAMBA3 ; and so on [homes] comment = %U's Home Directories valid users = %S read only = No nt acl support = No browseable = No --------- 4. Browse "\\SAMBA3\nakaji", the Japanese filename in euc-jp is not displayed correctly. http://www.rc.tutrp.tut.ac.jp/~nakaji/tmp/libiconv-NG.JPG 5. With patched libiconv and same smb.conf, it is right. http://www.rc.tutrp.tut.ac.jp/~nakaji/tmp/libiconv-OK.JPG Thanks. -- NAKAJI Hiroyuki
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87zn6a2cdy.fsf>