Date: Sun, 8 Dec 2013 11:20:59 +0900 From: Tomoaki AOKI <junchoon@dec.sakura.ne.jp> To: freebsd-stable@freebsd.org Subject: Problems in base iconv conversion Message-ID: <20131208112059.f892398f0a22f58311684c2f@dec.sakura.ne.jp>
next in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
Hi
In base iconv, some character sets have problem, mostly related to
single-byte JIS X 0201 Kana (aka Half Width Kana) and multi-bytes JIS X
0213 (2004 version is the newest standard for now).
The problems can be stratified by 3 error patterns.
1. Illegal byte sequence (in Gnu iconv, "cannot convert")
2. Invalid argument (in Gnu iconv, "unsupported")
3. Invalid characters
What I tried is:
*Select Japanese codes from `icomv -l`. (Possibly dropped some codes)
*Stratify them by whether single-byte or multi-bytes.
*Convert simple test string (no meaning as sentences) from UTF-8 to
target code using `iconv -f UTF-8 -t (target) (TestString)` and its
reverse conversion, and compare reverse converted string with
original test string. If error occurred, stratify by it and record
the output in hex form.
Please see attached PDF for detail (Notes are basically for base iconv).
Base iconv in stable/10 r258701 and Gnu iconv from ports in stable/9 for
reference.
Strangely, although all target is listed in `iconv -l` (base iconv,
not all of them are listed in ports Gnu iconv), some target caught error
"invalid argument" and no output string to stdout. This shall not
happen, and should be gracefully supported or dropped from list.
In other error pattern, output strings are erroneously converted.
In some case dropped some character, or converted to alternative
character for error case (GETA MARK).
But I'd need to mention that mapping non-supported character to GETA
MARK is normal treatment for multi-bytes case because not all UTF-8
characters are supported in every codes. Dropping unsupported is
considered as really abnormal in most cases.
Can someone confirm and fix? Looking in src tree, corresponding csmapper
sources seems existing. But my knowledge in iconv internals is
insufficient, so I can't figure out why these error occurs.
I have no fix, sorry. (It's beyonds my ability).
Some technical notes:
In JIS X 0201 and its variants, half width katakana characters are
supported directly in 8bits encoding and via shift-out/shift-in in
7bits encoding.
JIS X 0212 is extension for JIS X 0208. Not superset of JIS X 0208.
In other hand, JIS X 0213 is modified superset of JIS X 0208. (Includes
almost all of JIS X 0208 but not compatible as some code points are
changed, subsumed or splitted. In addition, many of MS extended
characters are included.)
In strict EUC-JP (equals to EUC variant of ISO2022-JP), half width
katakana characters are intentionally unsupported, but EUC-JP itself can
support them as 2-bytes form lead by 0x8E followed by JIS X 0201 code.
In strict EUC-JP, JIS X 0212 extended characters are not supported, but
EUC-JP itself can support as 3-bytes form lead by 0x8F.
In my multi-bytes test string, codepoint 0xE2 0x85 0xB1 in UTF-8 is
vendor specific in JIS X 0208 and JIS X 0208 + 0212 (equals to
ISO2022-JP-1 excluding half width katanaka characters), including
SHIFT_JIS variants and EUC variants. Some of these vendor specific
characters are introduced into standard in JIS X 0213.
Vendor specific variants such as CP932 already have them from before
JIS X 0213.
Regards.
--
Tomoaki AOKI junchoon@dec.sakura.ne.jp
[-- Attachment #2 --]
%PDF-1.4
%äüöß
2 0 obj
<</Length 3 0 R/Filter/FlateDecode>>
stream
x]K8ׯy6h9z=43t{^E9u*TeЦ>RIQ^/MjRɯL/~?MGܘ_r@029b+a_G'O@$$j‿L`H/+Rm QߛAHn'D$NhMc'2A r kTe킯Ю'O|NH.Slde UPʝ>7`;|bb~Y(Q
X`,QbHF8`+c^y! A|Ί>V5qCΚ
'DD3A[Y*xq2ABSd܄(֢{GW}՞xu>ґjZ|H 0EjEP3A2q8Qle@U >vFu&_SD$ʈ5>,;qGď<I2!Rj탢D璢9=M[]
lPA r1J[0ְC"edT8`+,i A|NMk5fTGX,Uu^/3T໕'
R!PFcw('jyf(R)$؝:
%jzIUrlZt'j
/5,}'kR)$؝5qV |L`H Q[OHU$ͦW)"Qle(
CPNHԂ#QldeaTA*ʈ|NDMxI)9u^g>ŷsS!H彳S&57Ymh$7mY%5fYFB ~+wO?H,3l'&o5"AT/J[J z]
A++F QA/ǀ*KS'Q
X=HU W\E&p<δ3)$ñXCUFR"U<$Ck6B}nVVZWT.]I====l/,k]NWֿ|IEGO$1)v~WfʞWE=t߱\7?yjIG*9H4og[ v / ]C,udo2%iW )d-3,]F#u(N;KR^D!**Kl},_9X7404eF*w#rnQs7"v$<8DLFԉсHR~N6đ0~I6DFt,k*f#&D#,Nh>`6e jLD*w#rn`;A۔vB(D8
3)R0_M#AڴZvq
Y4{9'9`mL|2. i+&9PaL߽Å?2*
jZ?zUuګ7]+fjN,y6 |D[Hm|yZ)(t5Jpf ZWT1 @7 9n`u]% &W[NC pFCh8Kc7\}W ʤҝ+
o_Z>h?],)goW~h5%kq?QH˩,o_{#*uCgSoѦj,6S`76Aa5;B\1 g6YW
#uzyk
hAK'Ac,|FLsZYn_J7Ci}%XWdiOc\u(>RmW[h=:h)8պS|eAcʡwؘ۬7:v-`3io\;Zndigei`'C'?gQห^;LљC$C.rI~jg1nnAFoʟ0BҧK3MXIq?{#Hmd#H##ȨXeo)
8q>2jK?G38}0nBzα b73Цr}{} w}W #ݧ\
P+bﺩVjFYM81|X>
EY5jJ%XlNT>=}ZRkN^6:Is*+# 32
31qE}sBbנ[FcsFL`Wy}G褼b&2`DB|@=@5>U59g F84(S0$]ǥK.<2@\+T얗G̵h
@(gaaEcUh50jIYV#HCJ(JQ
ޫuCixԊ98䯴Nµd\=&w*kwLZޫfʷo'ئͅvB<vS|0}Ҿ{;
n' !6<qDn"Sc(zήXS%0Kg>N*Ѝ|~ )N︵"8r9FRb`vW6QntL%g27e*5FfZt3I[2[Џz^<qzo4K_evkB`"g_ɋ}S& ?2z`0S07e0`4J&Lv3M22d"gS#(#r
N/M^˽<4ZJWΆۙJ٫)$ FgCC)6)noA+Tg8$oT<
_⩁ܒo%c8bI8MT?]7m7? ݓl=:H-EED>aqk]3C^u@?~
RPEq^dNFe=嶂"bKAoo"{tm6_T
fdւ2_;@1[I|f}Iw
)=RL7@o״({
!q/*bĺa6T1>!x@HK"}U'{tHDfWc$Yf݁cKm{
ϷIY)[t NyZaNOca-E4F3z:AuqQtҴxCMs<"ٻᔒuƩ> OH?Y^_#7Hr7Mݫ{s{3JRJŹxlCMn=[g&Urk)"˟RZ\'W?xÛC95g~}jԜWӓN]%75b9)PD2zLO'bCe aK
VFuuRgOd<+$
Im"qɬX2NwU$"tn4G67;yGf+wl XY7&^Ej>y Usk\l@-v4>P#J
gj2>d(Jk9w!43O17`KJkcNGkU^;tqz9lLSz-=FJP{@KR~,>/jm5JOrܢZ=V$[/?Vn륑Z=V#D[-'ي2
DfH~Z~#sWCWDrᏊAw;!ɕТ[no&ַIbGCYǽZSFjh%E]O]IKuzdڅv20/؛TiVP
?K댱T{M7Һ_
yUYNccuC&`f&0ڈ}gd5U<cAC
t\۷Bع ({YKΣC^h#htOYH.^Evhh>?L<hZF@<#Zhw#AwES:)a.#|˛c.?BCzˏ]^hw#Aw./p?c"l<Ld~l*S&>?wJ?q-:k<`q-;론{,ȅףQ> G=eFk 7Ou{wnO9jVu+>^~}a-~{J|އC9=}T~|F>=#X3{]ʈ#(:̈ېʈc(;Έ#AweD 37w]?jrxT2ęOO:^z?Av"Dܴ0t|XCWhw#Aw+p?HbFB@1ܠIÓAi:;_ܑ_g9;HֆA&2u?-IMP_G>p?y<5}GD5=yΏV$zZNLN}#sY֧]x%=55fY7@>@KU(I«sEE@"i\$qy$8R*zP O(>s ő9,FHһIH⛠ITyoTC5I7?R
NHI/|H$!<#GI1gph D =>]o$|ՐO E| gOߚ`n^ѥ:wqX+eR+ϿLdœpĈY;"z~#8ů~vM0[ $XFoN1KR%2|(wuX
ޑ{Y~GzGsx|KJoTwW- |5ܔmϚ7c"dhT+"w#pdß-rXq. J|1}M^YN" .d7Ι"K>xys>P&{~'Cc<$4;C#1!?|THCyHe[6}gnaNG
endstream
endobj
3 0 obj
5309
endobj
4 0 obj
<</Type/XObject
/Subtype/Form
/BBox[ -9 420 604 420.1 ]
/Group<</S/Transparency/CS/DeviceRGB/K true>>
/Length 8
/Filter/FlateDecode
>>
stream
x
endstream
endobj
5 0 obj
<</CA 0.5
/ca 0.5
>>
endobj
7 0 obj
<</Length 8 0 R/Filter/FlateDecode>>
stream
x]K6ޟ_u}%A:]Ew3F#A2CR/ʢJӢQ'Y%}R+ofN?M? /Iɹo\oK!)د??~Ӈ-1Jt YdIEY(9]O^Ȩn,y:ۄ%9k)w&n0c#ì쯐k>hRd"dEReW6%z@ՄYXE Ibt_|A^160,\/!%1>fУ,XHjfsETJ2Qݱf41kBN
~3NAEUT/wVW3+o:@AkWiyЊL4
̠QzF+ 8Z'RrȊT)G^}XxGMX|ZY6gsTڜ` SGd(yDf gc~w[}@[wO,A~ m]kV+
\
O"}tE@*eprt5 RA> Ct5 _1cIoĂFhDxqP-kXyYݧrGwrgca11J:X\ᏉGL
ի9]u^X!̩\ԥpZQ5uLRx >r*WZ6+^5NR.^ںoxwsOths@}33ؙP⁺GOtDV@
!$0!g*= w_άZ+Cua7
OG~ˏe d6I8UT=xTH.9O | $X n5 ǒKTj&IUy4$XKfjR>P4(#F./(d98d=-# zc :_s&af㋵oe4UxM&vF"5J`ygTHz5)VE
:HZ*a^M!_k97c-6jAz5eXoߠ>HZ)7tF +
v8g:cb$=ƤYBצ(UT0H
#I7c8ϚJ9U KW1bzi̩a^I1W0j-1
Zgm/KXo=e~`-4q=t2(U0H1`xz(GbAz5%
PC7Nk"IR U;N_ۨ}jI1fZ6J
Az,,M`H5U c)#I7czAFXJhICuzڤRF&n0QHz,YlcAz,=4q=ɇR*11Ң^ =01"ǀQw!04{'k+Xg_/GF
h?SugEz`ZnɔVs[-(a^Mql1lkyڲ6\B>ff*uʥoČu8]f\=KtY~dCh5#DYXV7qrmn# ~jML)>-/%l<I٫ͧE^kғ-I`}zZ/Q=I#)fD1꽞+|ž`5zRF$/Z/u^O}\1}ygT#)f\oK~MK2gf\f[!E0ǵ5VsU
Ki}V_eea՞693\Y/8#Z`
v@q6j퀤'9ౢWEpAOz\[atB%:!INyX茐[tF4Sy퀘Cp@ғk\tEAKj|sf耐KtAГƍLrz3.QZ|iVv턘K+5hm>kκVTvﱶo`lz#2b=RgWzZ\r!U0}v_YND'%'TՉx<W6/_;" {R >t|E+Q-ZM-/9^;ݐZ'6
hkw_oQrKZ%mЃXAK5sxڥ#W+4+`#~U2VM)䀤'T IvD):rU/9aOHSk hWaPۀ47Eͱ
tƹʘÆ^A-AyQtZ$ Iu^җ= ~ւZCt@<:8D+7LgLvGГE9/x(p'rdr3H;G>`̻ak/e$8lL/ Q|q z CAeN-w/&b3䘨&5K @vtLZnlS'yk.[vtд6X[cbcuT46o沝&i^[7p浩Ju[0GG#`+&VMS5{)K6wAXmWƎ#F[^*}-Ki^*+ǘm27|ٹt?%m~J/缨qw?snxv9Dr]VE9`-}uR=zNTqgCFhsa6ZP)4FͻtR6\Fv" WB3z\oc:sΞ+g>
|EqUr3WL1\E{ o}fN'miUǜo8{-߿xaOY˖[!peE\;:Z؟v<(CMG1tحP3.`{AgZ^f!vnBqs;oaՑe~
vgyŖp1F/^bn3l
٭
G6C(\g
wMYb#&L,F2ڄ|0i{Gb=?]CgtF~J )SM4/q'&P"VuO:{a#ʝ]ha.14Ns˘i
v[0wuԷi
ڗ;
b)Kh+-O}k7֨ǩeL{+`\}EƟvO
,O}ߴ1߂nx&ԽH{ٴDzC\<߈i_|p%U+c,iܿ<5nՇe+cMylj`*}8~럣a%sC4n/!˧v*#m
ZN/lc|'!N
qH/Sxi'> w,(E)ݚ'$؉KF>ť7D)#!JuQqכWsI88<_zMoSFzC
+b-O ]qHoSAzer
?;i8lzC
Q8/R<}-\%'s{6Jv( Gw=2Kc&lNlR©gٴ{
QH/Ry9cp0Q2GUvtz8g;┑8o<^'XN|Q4F|d7D =EitPܮ}>