Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 2 Sep 2011 08:03:38 +0200
From:      Wolfgang Zenker <wolfgang@lyxys.ka.sub.org>
To:        Gabor Kovesdan <gabor@freebsd.org>
Cc:        Andrey Chernov <ache@nagual.pp.ru>, freebsd-standards@freebsd.org, freebsd-i18n@freebsd.org
Subject:   Re: POSIX regex VS. multi-byte characters
Message-ID:  <20110902060338.GA8192@lyxys.ka.sub.org>
In-Reply-To: <4E603AA3.1040204@FreeBSD.org>
References:  <4E603AA3.1040204@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Gabor,

* Gabor Kovesdan <gabor@freebsd.org> [110902 04:08]:
> While working on bringing in a new regex code to FreeBSD, I came into an 
> issue. POSIX says here: 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09

> "Matching shall be based on the bit pattern used for encoding the 
> character, not on the graphic representation of the character. This 
> means that if a character set contains two or more encodings for a 
> graphic symbol, or if the strings searched contain text encoded in more 
> than one codeset, no attempt is made to search for any other 
> representation of the encoded symbol. If that is required, the user can 
> specify equivalence classes containing all variations of the desired 
> graphic symbol."

> According to my interpretation of this text, if someone specifies a 
> single bit as pattern that can be a prefix of a multi-byte character 
> that shall match, since match is based on bit pattern not semantical 
> meaning. Besides, in a consistent environment that uses a single 
> encoding and also supposing a user with common sense that would not 
> enter meaningless input, only whole characters should occur in the 
> pattern. However, GNU grep has a test in its regression test suite that 
> contradicts to this and chooses the opposite approach, i.e. it shall not 
> match a fragment of a character. Looking at the standard, I think GNU 
> grep is incorrect and my interpretation is the correct one.

I think you are misinterpreting the standard here. As I read it, the
phrase "bit pattern used for encoding the character" means the complete
byte sequence that encodes the character, not just a byte. The paragraph
quoted above talks about characters that have several different encodings
like e.g. characters that exist as single codepoint but can also be
encoded using diacritical marks and a base character.

Wolfgang



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110902060338.GA8192>