From owner-freebsd-i18n@FreeBSD.ORG Fri Sep 2 06:37:57 2011 Return-Path: Delivered-To: freebsd-i18n@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8E0EA106564A; Fri, 2 Sep 2011 06:37:57 +0000 (UTC) (envelope-from wolfgang@lyxys.ka.sub.org) Received: from saturn.lyxys.ka.sub.org (saturn.lyxys.ka.sub.org [217.29.35.151]) by mx1.freebsd.org (Postfix) with ESMTP id DCDBE8FC12; Fri, 2 Sep 2011 06:37:56 +0000 (UTC) Received: from juno.lyxys.ka.sub.org (juno.lyx [IPv6:fd2a:89ca:7d54:0:20f:feff:fe0e:7312]) by saturn.lyxys.ka.sub.org (8.14.2/8.14.2) with ESMTP id p8263edm006884; Fri, 2 Sep 2011 08:03:40 +0200 (CEST) (envelope-from wolfgang@lyxys.ka.sub.org) Received: from juno.lyxys.ka.sub.org (localhost [127.0.0.1]) by juno.lyxys.ka.sub.org (8.14.5/8.14.5) with ESMTP id p8263d6P008359; Fri, 2 Sep 2011 08:03:39 +0200 (CEST) (envelope-from wolfgang@lyxys.ka.sub.org) Received: (from wolfgang@localhost) by juno.lyxys.ka.sub.org (8.14.5/8.14.5/Submit) id p8263dIl008358; Fri, 2 Sep 2011 08:03:39 +0200 (CEST) (envelope-from wolfgang@lyxys.ka.sub.org) X-Authentication-Warning: juno.lyx: wolfgang set sender to wolfgang@lyxys.ka.sub.org using -f Date: Fri, 2 Sep 2011 08:03:38 +0200 From: Wolfgang Zenker To: Gabor Kovesdan Message-ID: <20110902060338.GA8192@lyxys.ka.sub.org> References: <4E603AA3.1040204@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E603AA3.1040204@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Organization: private site Cc: Andrey Chernov , freebsd-standards@freebsd.org, freebsd-i18n@freebsd.org Subject: Re: POSIX regex VS. multi-byte characters X-BeenThere: freebsd-i18n@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD Internationalization Effort List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Sep 2011 06:37:57 -0000 Hi Gabor, * Gabor Kovesdan [110902 04:08]: > While working on bringing in a new regex code to FreeBSD, I came into an > issue. POSIX says here: > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09 > "Matching shall be based on the bit pattern used for encoding the > character, not on the graphic representation of the character. This > means that if a character set contains two or more encodings for a > graphic symbol, or if the strings searched contain text encoded in more > than one codeset, no attempt is made to search for any other > representation of the encoded symbol. If that is required, the user can > specify equivalence classes containing all variations of the desired > graphic symbol." > According to my interpretation of this text, if someone specifies a > single bit as pattern that can be a prefix of a multi-byte character > that shall match, since match is based on bit pattern not semantical > meaning. Besides, in a consistent environment that uses a single > encoding and also supposing a user with common sense that would not > enter meaningless input, only whole characters should occur in the > pattern. However, GNU grep has a test in its regression test suite that > contradicts to this and chooses the opposite approach, i.e. it shall not > match a fragment of a character. Looking at the standard, I think GNU > grep is incorrect and my interpretation is the correct one. I think you are misinterpreting the standard here. As I read it, the phrase "bit pattern used for encoding the character" means the complete byte sequence that encodes the character, not just a byte. The paragraph quoted above talks about characters that have several different encodings like e.g. characters that exist as single codepoint but can also be encoded using diacritical marks and a base character. Wolfgang