From owner-freebsd-standards@FreeBSD.ORG Fri Sep 2 02:26:10 2011 Return-Path: Delivered-To: freebsd-standards@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1E146106566B; Fri, 2 Sep 2011 02:26:10 +0000 (UTC) (envelope-from gabor@FreeBSD.org) Received: from server.mypc.hu (server.mypc.hu [87.229.73.95]) by mx1.freebsd.org (Postfix) with ESMTP id A32588FC14; Fri, 2 Sep 2011 02:26:09 +0000 (UTC) Received: from server.mypc.hu (localhost [127.0.0.1]) by server.mypc.hu (Postfix) with ESMTP id 77AE014E5C9E; Fri, 2 Sep 2011 04:08:41 +0200 (CEST) X-Virus-Scanned: amavisd-new at server.mypc.hu Received: from server.mypc.hu ([127.0.0.1]) by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id YWxInx5JKUWk; Fri, 2 Sep 2011 04:08:39 +0200 (CEST) Received: from [192.168.1.106] (catv-80-98-232-12.catv.broadband.hu [80.98.232.12]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by server.mypc.hu (Postfix) with ESMTPSA id 0452214DB679; Fri, 2 Sep 2011 04:08:38 +0200 (CEST) Message-ID: <4E603AA3.1040204@FreeBSD.org> Date: Fri, 02 Sep 2011 04:08:35 +0200 From: Gabor Kovesdan User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:9.0a1) Gecko/20110822 Thunderbird/9.0a1 MIME-Version: 1.0 To: freebsd-standards@FreeBSD.org, freebsd-i18n@FreeBSD.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrey Chernov Subject: POSIX regex VS. multi-byte characters X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Sep 2011 02:26:10 -0000 Hi Folks, While working on bringing in a new regex code to FreeBSD, I came into an issue. POSIX says here: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09 "Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol." According to my interpretation of this text, if someone specifies a single bit as pattern that can be a prefix of a multi-byte character that shall match, since match is based on bit pattern not semantical meaning. Besides, in a consistent environment that uses a single encoding and also supposing a user with common sense that would not enter meaningless input, only whole characters should occur in the pattern. However, GNU grep has a test in its regression test suite that contradicts to this and chooses the opposite approach, i.e. it shall not match a fragment of a character. Looking at the standard, I think GNU grep is incorrect and my interpretation is the correct one. Could you please comment on this? Thanks, Gabor Kovesdan