Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 17 Feb 2013 17:11:10 -0500
From:      "J.R. Oldroyd" <fbsd@opal.com>
To:        freebsd-i18n@freebsd.org
Subject:   locale/euc.c encoding modifying input characters
Message-ID:  <20130217171110.5ff55687@shibato>

next in thread | raw e-mail | index | archive | help
In the mbrtowc code in src/lib/libc/locale/euc.c, just before a character
is returned, there is this code:

	wc = (wc & ~CEI->mask) | CEI->bits[set];

which has the effect of ensuring that bits 0x8080 are on in a 2-byte
character, even if they were not on in the input.

Why is this code there?  It means that characters returned will be
different from the input data if the input data contains invalid characters.

I think the code needs replacing with a validation that those bits are
set and an error return if not:

	if (wc != ((wc & ~CEI->mask) | CEI->bits[set])) {
		/* Invalid multibyte sequence */
		errno = EILSEQ;
		return ((size_t)-1);
	}

I'm asking for this change because if you read some data that is in
another locale while you're inadvertently set to euc, you want to know
you hit a data error rather than having some other data value returned.

For anyone looking, the code has been there since 2002 when the euc
locale had it's own mbrtowc code added:

Revision 121893 - (view) (download) (annotate) - [select for diffs] 
Modified Sun Nov 2 10:09:33 2003 UTC (9 years, 3 months ago) by tjr 
File length: 5611 byte(s) 
Diff to previous 101566
Convert the Big5, EUC, MSKanji and UTF-8 encoding methods to implement
mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left
unconverted; GB18030 will be done eventually, but GBK and UTF2 may just
be removed, as they are subsets of GB18030 and UTF-8 respectively.

	-jr



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130217171110.5ff55687>