From owner-freebsd-i18n@FreeBSD.ORG  Fri Sep  2 02:26:10 2011
Return-Path: <owner-freebsd-i18n@FreeBSD.ORG>
Delivered-To: freebsd-i18n@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1E146106566B;
	Fri,  2 Sep 2011 02:26:10 +0000 (UTC)
	(envelope-from gabor@FreeBSD.org)
Received: from server.mypc.hu (server.mypc.hu [87.229.73.95])
	by mx1.freebsd.org (Postfix) with ESMTP id A32588FC14;
	Fri,  2 Sep 2011 02:26:09 +0000 (UTC)
Received: from server.mypc.hu (localhost [127.0.0.1])
	by server.mypc.hu (Postfix) with ESMTP id 77AE014E5C9E;
	Fri,  2 Sep 2011 04:08:41 +0200 (CEST)
X-Virus-Scanned: amavisd-new at server.mypc.hu
Received: from server.mypc.hu ([127.0.0.1])
	by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024)
	with LMTP id YWxInx5JKUWk; Fri,  2 Sep 2011 04:08:39 +0200 (CEST)
Received: from [192.168.1.106] (catv-80-98-232-12.catv.broadband.hu
	[80.98.232.12])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by server.mypc.hu (Postfix) with ESMTPSA id 0452214DB679;
	Fri,  2 Sep 2011 04:08:38 +0200 (CEST)
Message-ID: <4E603AA3.1040204@FreeBSD.org>
Date: Fri, 02 Sep 2011 04:08:35 +0200
From: Gabor Kovesdan <gabor@FreeBSD.org>
User-Agent: Mozilla/5.0 (Windows NT 5.1;
	rv:9.0a1) Gecko/20110822 Thunderbird/9.0a1
MIME-Version: 1.0
To: freebsd-standards@FreeBSD.org, freebsd-i18n@FreeBSD.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Andrey Chernov <ache@nagual.pp.ru>
Subject: POSIX regex VS. multi-byte characters
X-BeenThere: freebsd-i18n@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD Internationalization Effort <freebsd-i18n.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-i18n>
List-Post: <mailto:freebsd-i18n@freebsd.org>
List-Help: <mailto:freebsd-i18n-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Sep 2011 02:26:10 -0000

Hi Folks,

While working on bringing in a new regex code to FreeBSD, I came into an 
issue. POSIX says here: 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09

"Matching shall be based on the bit pattern used for encoding the 
character, not on the graphic representation of the character. This 
means that if a character set contains two or more encodings for a 
graphic symbol, or if the strings searched contain text encoded in more 
than one codeset, no attempt is made to search for any other 
representation of the encoded symbol. If that is required, the user can 
specify equivalence classes containing all variations of the desired 
graphic symbol."

According to my interpretation of this text, if someone specifies a 
single bit as pattern that can be a prefix of a multi-byte character 
that shall match, since match is based on bit pattern not semantical 
meaning. Besides, in a consistent environment that uses a single 
encoding and also supposing a user with common sense that would not 
enter meaningless input, only whole characters should occur in the 
pattern. However, GNU grep has a test in its regression test suite that 
contradicts to this and chooses the opposite approach, i.e. it shall not 
match a fragment of a character. Looking at the standard, I think GNU 
grep is incorrect and my interpretation is the correct one.

Could you please comment on this?

Thanks,
Gabor Kovesdan


From owner-freebsd-i18n@FreeBSD.ORG  Fri Sep  2 06:37:57 2011
Return-Path: <owner-freebsd-i18n@FreeBSD.ORG>
Delivered-To: freebsd-i18n@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8E0EA106564A;
	Fri,  2 Sep 2011 06:37:57 +0000 (UTC)
	(envelope-from wolfgang@lyxys.ka.sub.org)
Received: from saturn.lyxys.ka.sub.org (saturn.lyxys.ka.sub.org
	[217.29.35.151])
	by mx1.freebsd.org (Postfix) with ESMTP id DCDBE8FC12;
	Fri,  2 Sep 2011 06:37:56 +0000 (UTC)
Received: from juno.lyxys.ka.sub.org (juno.lyx
	[IPv6:fd2a:89ca:7d54:0:20f:feff:fe0e:7312])
	by saturn.lyxys.ka.sub.org (8.14.2/8.14.2) with ESMTP id p8263edm006884;
	Fri, 2 Sep 2011 08:03:40 +0200 (CEST)
	(envelope-from wolfgang@lyxys.ka.sub.org)
Received: from juno.lyxys.ka.sub.org (localhost [127.0.0.1])
	by juno.lyxys.ka.sub.org (8.14.5/8.14.5) with ESMTP id p8263d6P008359; 
	Fri, 2 Sep 2011 08:03:39 +0200 (CEST)
	(envelope-from wolfgang@lyxys.ka.sub.org)
Received: (from wolfgang@localhost)
	by juno.lyxys.ka.sub.org (8.14.5/8.14.5/Submit) id p8263dIl008358;
	Fri, 2 Sep 2011 08:03:39 +0200 (CEST)
	(envelope-from wolfgang@lyxys.ka.sub.org)
X-Authentication-Warning: juno.lyx: wolfgang set sender to
	wolfgang@lyxys.ka.sub.org using -f
Date: Fri, 2 Sep 2011 08:03:38 +0200
From: Wolfgang Zenker <wolfgang@lyxys.ka.sub.org>
To: Gabor Kovesdan <gabor@freebsd.org>
Message-ID: <20110902060338.GA8192@lyxys.ka.sub.org>
References: <4E603AA3.1040204@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4E603AA3.1040204@FreeBSD.org>
User-Agent: Mutt/1.4.2.3i
Organization: private site
Cc: Andrey Chernov <ache@nagual.pp.ru>, freebsd-standards@freebsd.org,
	freebsd-i18n@freebsd.org
Subject: Re: POSIX regex VS. multi-byte characters
X-BeenThere: freebsd-i18n@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD Internationalization Effort <freebsd-i18n.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-i18n>
List-Post: <mailto:freebsd-i18n@freebsd.org>
List-Help: <mailto:freebsd-i18n-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Sep 2011 06:37:57 -0000

Hi Gabor,

* Gabor Kovesdan <gabor@freebsd.org> [110902 04:08]:
> While working on bringing in a new regex code to FreeBSD, I came into an 
> issue. POSIX says here: 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09

> "Matching shall be based on the bit pattern used for encoding the 
> character, not on the graphic representation of the character. This 
> means that if a character set contains two or more encodings for a 
> graphic symbol, or if the strings searched contain text encoded in more 
> than one codeset, no attempt is made to search for any other 
> representation of the encoded symbol. If that is required, the user can 
> specify equivalence classes containing all variations of the desired 
> graphic symbol."

> According to my interpretation of this text, if someone specifies a 
> single bit as pattern that can be a prefix of a multi-byte character 
> that shall match, since match is based on bit pattern not semantical 
> meaning. Besides, in a consistent environment that uses a single 
> encoding and also supposing a user with common sense that would not 
> enter meaningless input, only whole characters should occur in the 
> pattern. However, GNU grep has a test in its regression test suite that 
> contradicts to this and chooses the opposite approach, i.e. it shall not 
> match a fragment of a character. Looking at the standard, I think GNU 
> grep is incorrect and my interpretation is the correct one.

I think you are misinterpreting the standard here. As I read it, the
phrase "bit pattern used for encoding the character" means the complete
byte sequence that encodes the character, not just a byte. The paragraph
quoted above talks about characters that have several different encodings
like e.g. characters that exist as single codepoint but can also be
encoded using diacritical marks and a base character.

Wolfgang

From owner-freebsd-i18n@FreeBSD.ORG  Fri Sep  2 09:16:42 2011
Return-Path: <owner-freebsd-i18n@FreeBSD.ORG>
Delivered-To: freebsd-i18n@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5B066106564A;
	Fri,  2 Sep 2011 09:16:42 +0000 (UTC) (envelope-from ache@vniz.net)
Received: from vniz.net (vniz.net [194.87.13.69])
	by mx1.freebsd.org (Postfix) with ESMTP id B7C668FC0C;
	Fri,  2 Sep 2011 09:16:41 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by vniz.net (8.14.5/8.14.5) with ESMTP id p828wF01091148;
	Fri, 2 Sep 2011 12:58:15 +0400 (MSK) (envelope-from ache@vniz.net)
Received: (from ache@localhost)
	by localhost (8.14.5/8.14.5/Submit) id p828wFcQ091147;
	Fri, 2 Sep 2011 12:58:15 +0400 (MSK) (envelope-from ache)
Date: Fri, 2 Sep 2011 12:58:14 +0400
From: Andrey Chernov <ache@freebsd.org>
To: Wolfgang Zenker <wolfgang@lyxys.ka.sub.org>
Message-ID: <20110902085814.GA90871@vniz.net>
Mail-Followup-To: Andrey Chernov <ache@freebsd.org>,
	Wolfgang Zenker <wolfgang@lyxys.ka.sub.org>,
	Gabor Kovesdan <gabor@freebsd.org>, freebsd-standards@freebsd.org,
	freebsd-i18n@freebsd.org
References: <4E603AA3.1040204@FreeBSD.org>
	<20110902060338.GA8192@lyxys.ka.sub.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20110902060338.GA8192@lyxys.ka.sub.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-standards@freebsd.org, Gabor Kovesdan <gabor@freebsd.org>,
	freebsd-i18n@freebsd.org
Subject: Re: POSIX regex VS. multi-byte characters
X-BeenThere: freebsd-i18n@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD Internationalization Effort <freebsd-i18n.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-i18n>
List-Post: <mailto:freebsd-i18n@freebsd.org>
List-Help: <mailto:freebsd-i18n-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i18n>,
	<mailto:freebsd-i18n-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Sep 2011 09:16:42 -0000

On Fri, Sep 02, 2011 at 08:03:38AM +0200, Wolfgang Zenker wrote:
> Hi Gabor,
> 
> * Gabor Kovesdan <gabor@freebsd.org> [110902 04:08]:
> > While working on bringing in a new regex code to FreeBSD, I came into an 
> > issue. POSIX says here: 
> > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09
> 
> > "Matching shall be based on the bit pattern used for encoding the 
> > character, not on the graphic representation of the character. This 
> > means that if a character set contains two or more encodings for a 
> > graphic symbol, or if the strings searched contain text encoded in more 
> > than one codeset, no attempt is made to search for any other 
> > representation of the encoded symbol. If that is required, the user can 
> > specify equivalence classes containing all variations of the desired 
> > graphic symbol."
> 
> > According to my interpretation of this text, if someone specifies a 
> > single bit as pattern that can be a prefix of a multi-byte character 
> > that shall match, since match is based on bit pattern not semantical 
> > meaning. Besides, in a consistent environment that uses a single 
> > encoding and also supposing a user with common sense that would not 
> > enter meaningless input, only whole characters should occur in the 
> > pattern. However, GNU grep has a test in its regression test suite that 
> > contradicts to this and chooses the opposite approach, i.e. it shall not 
> > match a fragment of a character. Looking at the standard, I think GNU 
> > grep is incorrect and my interpretation is the correct one.
> 
> I think you are misinterpreting the standard here. As I read it, the
> phrase "bit pattern used for encoding the character" means the complete
> byte sequence that encodes the character, not just a byte. The paragraph
> quoted above talks about characters that have several different encodings
> like e.g. characters that exist as single codepoint but can also be
> encoded using diacritical marks and a base character.

1) As I read it, too. "bit pattern" means to be complete, not partial.
POSIX don't suppose partial or fragmened charaters match, all characters 
there are always complete and monolitic.

2) The whole intention says; i.e. graphically same Russsian 'a' should not 
match graphically same English 'a' inside giving character set like 
KOI8-R or Unicode.

3) Meaningless input should not match anything with meaning, so only one 
question remains, should meaningless input match exact the same 
meaningless input or should exit with error? POSIX grep() says nothing,
POSIX regexec() says not more than:
"The regcomp( ) and regexec( ) functions are required to accept any 
null-terminated string as the pattern argument. If the meaning of the 
string is 'undefined', the behavior of the function is 'unspecified'."
Currently GNU grep match meaningless input with exact the same in the 
file. Fragment of character (not completed) is meaningless input, so I 
don't see where GNU grep is opposite.
 
-- 
http://ache.vniz.net/