Date: Sun, 17 Feb 2013 17:11:10 -0500 From: "J.R. Oldroyd" <fbsd@opal.com> To: freebsd-i18n@freebsd.org Subject: locale/euc.c encoding modifying input characters Message-ID: <20130217171110.5ff55687@shibato>
next in thread | raw e-mail | index | archive | help
In the mbrtowc code in src/lib/libc/locale/euc.c, just before a character is returned, there is this code: wc = (wc & ~CEI->mask) | CEI->bits[set]; which has the effect of ensuring that bits 0x8080 are on in a 2-byte character, even if they were not on in the input. Why is this code there? It means that characters returned will be different from the input data if the input data contains invalid characters. I think the code needs replacing with a validation that those bits are set and an error return if not: if (wc != ((wc & ~CEI->mask) | CEI->bits[set])) { /* Invalid multibyte sequence */ errno = EILSEQ; return ((size_t)-1); } I'm asking for this change because if you read some data that is in another locale while you're inadvertently set to euc, you want to know you hit a data error rather than having some other data value returned. For anyone looking, the code has been there since 2002 when the euc locale had it's own mbrtowc code added: Revision 121893 - (view) (download) (annotate) - [select for diffs] Modified Sun Nov 2 10:09:33 2003 UTC (9 years, 3 months ago) by tjr File length: 5611 byte(s) Diff to previous 101566 Convert the Big5, EUC, MSKanji and UTF-8 encoding methods to implement mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left unconverted; GB18030 will be done eventually, but GBK and UTF2 may just be removed, as they are subsets of GB18030 and UTF-8 respectively. -jr
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130217171110.5ff55687>