From owner-freebsd-i18n@FreeBSD.ORG Sun Feb 17 22:11:20 2013 Return-Path: Delivered-To: freebsd-i18n@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C7F9D454 for ; Sun, 17 Feb 2013 22:11:20 +0000 (UTC) (envelope-from fbsd@opal.com) Received: from mho-02-ewr.mailhop.org (mho-04-ewr.mailhop.org [204.13.248.74]) by mx1.freebsd.org (Postfix) with ESMTP id A01A4AA1 for ; Sun, 17 Feb 2013 22:11:20 +0000 (UTC) Received: from pool-141-154-241-44.bos.east.verizon.net ([141.154.241.44] helo=homobox.opal.com) by mho-02-ewr.mailhop.org with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.72) (envelope-from ) id 1U7CS1-0004pe-Rq for freebsd-i18n@freebsd.org; Sun, 17 Feb 2013 22:11:13 +0000 Received: from shibato (shibato.opal.com [IPv6:2001:470:8cb8:4:221:63ff:fe5a:c9a7]) (authenticated bits=0) by homobox.opal.com (8.14.4/8.14.4) with ESMTP id r1HMBAbm089866 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Sun, 17 Feb 2013 17:11:11 -0500 (EST) (envelope-from fbsd@opal.com) X-Mail-Handler: Dyn Standard SMTP by Dyn X-Originating-IP: 141.154.241.44 X-Report-Abuse-To: abuse@dyndns.com (see http://www.dyndns.com/services/sendlabs/outbound_abuse.html for abuse reporting information) X-MHO-User: U2FsdGVkX1+rPVrONF/bVbzNFno6ONkm Date: Sun, 17 Feb 2013 17:11:10 -0500 From: "J.R. Oldroyd" To: freebsd-i18n@freebsd.org Subject: locale/euc.c encoding modifying input characters Message-ID: <20130217171110.5ff55687@shibato> X-Mailer: Claws Mail 3.9.0 (GTK+ 2.24.6; amd64-portbld-freebsd9.1) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (homobox.opal.com [IPv6:2001:470:8cb8:4::1]); Sun, 17 Feb 2013 17:11:11 -0500 (EST) X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, RP_MATCHES_RCVD shortcircuit=no autolearn=ham version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on homobox.opal.com X-BeenThere: freebsd-i18n@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: FreeBSD Internationalization Effort List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Feb 2013 22:11:20 -0000 In the mbrtowc code in src/lib/libc/locale/euc.c, just before a character is returned, there is this code: wc = (wc & ~CEI->mask) | CEI->bits[set]; which has the effect of ensuring that bits 0x8080 are on in a 2-byte character, even if they were not on in the input. Why is this code there? It means that characters returned will be different from the input data if the input data contains invalid characters. I think the code needs replacing with a validation that those bits are set and an error return if not: if (wc != ((wc & ~CEI->mask) | CEI->bits[set])) { /* Invalid multibyte sequence */ errno = EILSEQ; return ((size_t)-1); } I'm asking for this change because if you read some data that is in another locale while you're inadvertently set to euc, you want to know you hit a data error rather than having some other data value returned. For anyone looking, the code has been there since 2002 when the euc locale had it's own mbrtowc code added: Revision 121893 - (view) (download) (annotate) - [select for diffs] Modified Sun Nov 2 10:09:33 2003 UTC (9 years, 3 months ago) by tjr File length: 5611 byte(s) Diff to previous 101566 Convert the Big5, EUC, MSKanji and UTF-8 encoding methods to implement mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left unconverted; GB18030 will be done eventually, but GBK and UTF2 may just be removed, as they are subsets of GB18030 and UTF-8 respectively. -jr